<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<!-- generated by https://github.com/cabo/kramdown-rfc version 1.7.19 (Ruby 3.0.2) -->
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902" docName="draft-illyes-repext-02" category="info" consensus="true" submissionType="IETF" tocInclude="true" sortRefs="true" symRefs="true" version="3">
  <!-- xml2rfc v2v3 conversion 3.23.2 -->
  <front>
    <title abbrev="REPext for URI level">Robots Exclusion Protocol Extension for URI Level Control</title>
    <seriesInfo name="Internet-Draft" value="draft-illyes-repext-02"/>
    <author fullname="Gary Illyes">
      <organization>Google LLC.</organization>
      <address>
        <postal>
          <street>Brandschenkestrasse 110</street>
          <city>Zürich</city>
          <code>8002</code>
          <country>Switzerland</country>
        </postal>
        <email>garyillyes@google.com</email>
      </address>
    </author>
    <date year="2024" month="October" day="18"/>
    <keyword>robots.txt</keyword>
    <abstract>
      <?line 46?>

<t>This document extends RFC9309 by specifying additional URI level controls through application level header and HTML meta tags originally developed in 1996. Additionally it moves the response header out of the experimental header space (i.e. "X-") and defines the combinability of multiple headers, which was previously not possible.</t>
    </abstract>
    <note removeInRFC="true">
      <name>About This Document</name>
      <t>
        The latest revision of this draft can be found at <eref target="https://garyillyes.github.io/ietf-rep-ext/draft-illyes-repext.html"/>.
        Status information for this document may be found at <eref target="https://datatracker.ietf.org/doc/draft-illyes-repext/"/>.
      </t>
      <t>Source for this draft and an issue tracker can be found at
        <eref target="https://github.com/garyillyes/ietf-rep-ext"/>.</t>
    </note>
  </front>
  <middle>
    <?line 50?>

<section anchor="introduction">
      <name>Introduction</name>
      <t>While the Robots Exclusion Protocol enables service owners to control how, if at all, automated clients known as crawlers may access the URIs on their services as defined by <xref target="RFC8288"/>, the protocol doesn't provide controls on how the data returned by their service may be used upon allowed access.</t>
      <t>Originally developed in 1996 and widely adopted since, the use-case control is left to URI level controls implemented in the response headers, or in case of HTML in the form of a meta tag. This document specifies these control tags, and in case of the response header field, brings it to standards compliance with <xref target="RFC9110"/>.</t>
      <t>Application developers are requested to honor these tags. The tags are not a form of access authorization however.</t>
    </section>
    <section anchor="conventions-and-definitions">
      <name>Conventions and Definitions</name>
      <t>The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL
NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
"<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as
described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they
appear in all capitals, as shown here.</t>
      <?line -18?>

<t>This specification uses the following terms from <xref target="RFC9651"/>: Dictionary, List, String, Parameter.</t>
    </section>
    <section anchor="specification">
      <name>Specification</name>
      <section anchor="robots-control">
        <name>Robots control</name>
        <t>The URI level crawler controls are a key-value pair that can be specified two ways:</t>
        <ul spacing="normal">
          <li>
            <t>an application level response header structured field as specified by <xref target="RFC9651"/>.</t>
          </li>
          <li>
            <t>in case of HTML, one or more meta tags as defined by the HTML specification.</t>
          </li>
        </ul>
        <section anchor="application-layer-response-header">
          <name>Application Layer Response Header</name>
          <t>The application level response header field "robots-tag" is a structured field whose value is a dictionary containing list of rules applicable to either all accessors or specifically named ones. For historical reasons, implementors should also support the experimental field name, "x-robots-tag".</t>
          <t>The value of the robots-tag field is a dictionary containing lists of rules. The rules are specific to a single product token as defined by <xref target="RFC9309"/> or a global identifier — "*". The global identifier may be omitted. The product token is the first element of each list.</t>
          <t>Duplicate product tokens must be merged and the rules deduplicated.</t>
          <t>For example, the following response header field specifies "noindex" and "nosnippet" rules for all accessors, however specifies no rules for the product token "ExampleBot":</t>
          <t>abc_123;a=1;b=2;cdef_456, ghi;q=9;r="+w"
~~~~~~~~
Robots-Tag: *;noindex;nosnippet, ExampleBot;
~~~~~~~~</t>
          <t>The global product identifier "*" in the value may be omitted; for example, this field is equivalent to the previous example:</t>
          <artwork><![CDATA[
Robots-Tag: ;noindex;nosnippet, ExampleBot=;
]]></artwork>
          <t>The structured field in the examples is deserialized into the following objects:
~~~~~~~~
["*" = [["noindex", true], ["nosnippet", true]]],
["ExampleBot" = []]
~~~~~~~~</t>
          <t>Implementors <bcp14>SHOULD</bcp14> impose a parsing limit on the field value to protect their systems. The parsing limit <bcp14>MUST</bcp14> be at least 8 kibibytes <xref target="KiB"/>.</t>
        </section>
        <section anchor="html-meta-element">
          <name>HTML meta element</name>
          <t>For historical reasons the robots-tag header may be specified by service owners as an HTML meta tag. In case of the meta tag, the name attribute is used to specify the product token, and the content attribute to specify the comma separated robots-tag rules.</t>
          <t>As with the header, the product token may be a global token, "robots", which signifies that the rules apply to all requestors, or a specific product token applicable to a single requestor. For example:</t>
          <artwork><![CDATA[
<meta name="robots" content="noindex">
<meta name="examplebot" content="nosnippet">
]]></artwork>
          <t>Multiple robots meta elements may appear in a single HTML document. Requestors must obey the sum of negative rules specific to their product token and the global product token.</t>
        </section>
        <section anchor="robots-controls-rules">
          <name>Robots controls rules</name>
          <t>The possible values of the rules are:</t>
          <ul spacing="normal">
            <li>
              <t>noindex - instructs the parser to not store the served data in its publicly accessible index.</t>
            </li>
            <li>
              <t>nosnippet - instructs the parser to not reproduce any stored data as an excerpt snippet.</t>
            </li>
          </ul>
          <t>The values are case insensitive. Unsupported rules must be ignored.</t>
          <t>Implementors may support other rules as specified in Section 2.2.4 of <xref target="RFC9309"/>.</t>
        </section>
        <section anchor="caching-of-values">
          <name>Caching of values</name>
          <t>The rules specified for a specific product token must be obeyed until the rules have changed. Implementors <bcp14>MAY</bcp14> use standard cache control as defined in <xref target="RFC9110"/> for caching robots-tag rules. Implementors <bcp14>SHOULD</bcp14> refresh their caches within a reasonable time frame.</t>
        </section>
      </section>
    </section>
    <section anchor="security-considerations">
      <name>Security Considerations</name>
      <t>The robots-tag is not a substitute for valid content security measures. To control access to the URI paths in a robots.txt file, users of the protocol should employ a valid security measure relevant to the application layer on which the robots.txt file is served — for example, in the case of HTTP, HTTP Authentication as defined in <xref target="RFC9110"/>.</t>
      <t>The content of the robots-tag header field is not secure, private or integrity-guaranteed, and due caution should be exercised when using it. Use of Transport Layer Security (TLS) with HTTP (<xref target="RFC9110"/> and <xref target="RFC2817"/>) is currently the only end-to-end way to provide such protection.</t>
      <t>In case of a robots-tag specified in a HTML meta element, implementors should consider only the meta elements specified in the head element of the HTML document, which is generally only accessible to the service owner.</t>
      <t>To protect against memory overflow attacks, implementers should enforce a limit on how much data they will parse; see section N for the lower limit.</t>
    </section>
    <section anchor="iana-considerations">
      <name>IANA Considerations</name>
      <t><tt>
TODO(illyes):
https://www.rfc-editor.org/rfc/rfc9110.html#name-field-name-registry
</tt></t>
    </section>
  </middle>
  <back>
    <references anchor="sec-combined-references">
      <name>References</name>
      <references anchor="sec-normative-references">
        <name>Normative References</name>
        <reference anchor="RFC2817">
          <front>
            <title>Upgrading to TLS Within HTTP/1.1</title>
            <author fullname="R. Khare" initials="R." surname="Khare"/>
            <author fullname="S. Lawrence" initials="S." surname="Lawrence"/>
            <date month="May" year="2000"/>
            <abstract>
              <t>This memo explains how to use the Upgrade mechanism in HTTP/1.1 to initiate Transport Layer Security (TLS) over an existing TCP connection. [STANDARDS-TRACK]</t>
            </abstract>
          </front>
          <seriesInfo name="RFC" value="2817"/>
          <seriesInfo name="DOI" value="10.17487/RFC2817"/>
        </reference>
        <reference anchor="RFC8288">
          <front>
            <title>Web Linking</title>
            <author fullname="M. Nottingham" initials="M." surname="Nottingham"/>
            <date month="October" year="2017"/>
            <abstract>
              <t>This specification defines a model for the relationships between resources on the Web ("links") and the type of those relationships ("link relation types").</t>
              <t>It also defines the serialisation of such links in HTTP headers with the Link header field.</t>
            </abstract>
          </front>
          <seriesInfo name="RFC" value="8288"/>
          <seriesInfo name="DOI" value="10.17487/RFC8288"/>
        </reference>
        <reference anchor="RFC9651">
          <front>
            <title>Structured Field Values for HTTP</title>
            <author fullname="M. Nottingham" initials="M." surname="Nottingham"/>
            <author fullname="P-H. Kamp" surname="P-H. Kamp"/>
            <date month="September" year="2024"/>
            <abstract>
              <t>This document describes a set of data types and associated algorithms that are intended to make it easier and safer to define and handle HTTP header and trailer fields, known as "Structured Fields", "Structured Headers", or "Structured Trailers". It is intended for use by specifications of new HTTP fields.</t>
              <t>This document obsoletes RFC 8941.</t>
            </abstract>
          </front>
          <seriesInfo name="RFC" value="9651"/>
          <seriesInfo name="DOI" value="10.17487/RFC9651"/>
        </reference>
        <reference anchor="RFC9110">
          <front>
            <title>HTTP Semantics</title>
            <author fullname="R. Fielding" initials="R." role="editor" surname="Fielding"/>
            <author fullname="M. Nottingham" initials="M." role="editor" surname="Nottingham"/>
            <author fullname="J. Reschke" initials="J." role="editor" surname="Reschke"/>
            <date month="June" year="2022"/>
            <abstract>
              <t>The Hypertext Transfer Protocol (HTTP) is a stateless application-level protocol for distributed, collaborative, hypertext information systems. This document describes the overall architecture of HTTP, establishes common terminology, and defines aspects of the protocol that are shared by all versions. In this definition are core protocol elements, extensibility mechanisms, and the "http" and "https" Uniform Resource Identifier (URI) schemes.</t>
              <t>This document updates RFC 3864 and obsoletes RFCs 2818, 7231, 7232, 7233, 7235, 7538, 7615, 7694, and portions of 7230.</t>
            </abstract>
          </front>
          <seriesInfo name="STD" value="97"/>
          <seriesInfo name="RFC" value="9110"/>
          <seriesInfo name="DOI" value="10.17487/RFC9110"/>
        </reference>
        <reference anchor="RFC9309">
          <front>
            <title>Robots Exclusion Protocol</title>
            <author fullname="M. Koster" initials="M." surname="Koster"/>
            <author fullname="G. Illyes" initials="G." surname="Illyes"/>
            <author fullname="H. Zeller" initials="H." surname="Zeller"/>
            <author fullname="L. Sassman" initials="L." surname="Sassman"/>
            <date month="September" year="2022"/>
            <abstract>
              <t>This document specifies and extends the "Robots Exclusion Protocol" method originally defined by Martijn Koster in 1994 for service owners to control how content served by their services may be accessed, if at all, by automatic clients known as crawlers. Specifically, it adds definition language for the protocol, instructions for handling errors, and instructions for caching.</t>
            </abstract>
          </front>
          <seriesInfo name="RFC" value="9309"/>
          <seriesInfo name="DOI" value="10.17487/RFC9309"/>
        </reference>
        <reference anchor="RFC2119">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="March" year="1997"/>
            <abstract>
              <t>In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>
        <reference anchor="RFC8174">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
            <author fullname="B. Leiba" initials="B." surname="Leiba"/>
            <date month="May" year="2017"/>
            <abstract>
              <t>RFC 2119 specifies common key words that may be used in protocol specifications. This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the defined special meanings.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
          <seriesInfo name="DOI" value="10.17487/RFC8174"/>
        </reference>
      </references>
      <references anchor="sec-informative-references">
        <name>Informative References</name>
        <reference anchor="KiB" target="https://simple.wikipedia.org/wiki/Kibibyte">
          <front>
            <title>KibiByte</title>
            <author>
              <organization/>
            </author>
            <date year="2022" month="October" day="14"/>
          </front>
        </reference>
      </references>
    </references>
    <?line 168?>

<section numbered="false" anchor="acknowledgments">
      <name>Acknowledgments</name>
      <t>TODO acknowledge.</t>
    </section>
  </back>
  <!-- ##markdown-source:
H4sIAAAAAAAAA4VZ7XLbxhX9j6fYIj9qpwQlKkoqU5ET2pJtTWTLleRJU0WT
LIAluRGIZXYXophMOn2IPkAfpL/aN+mT9NzdxRfJJJ6xDSywd+/HuefeCyZJ
EllpCzFm8ZVKlTXs7DErKiNVyd5rZVWmCixZUbqlqdLsw9U5uxAPomAvVWm1
KuKIp6kWDyTj7L14tM1rBb0WRxm3Yqb0esxkOVVRlKus5AucmWs+tYksirUw
iRZL7E32DyJTpQtp6EC7XuK187ObV4x9xHhhFA6RZY5X8U9p4wGLRS6t0pIX
dHM+eYH/cHx8fnXzKo7KapEKPY5yqDBmB/sHh8loPxkdRZkqDYyqzJhZXYkI
2n8S4Qgt+JhNrs4muFkpfT/TqlqO2dev2de4k+WMvaaV6F6s8TgfRyxh2nlu
aB9t9CDKSoyxlzUb6cbb0ZeA5QWXBb3ypXjki2Uhhpla0DrX2XzM5tYuzXhv
r/NwD+IgWtp5lcITM67X3nt7UtgpuTCBD2O8U8BgY/FOLaV9d+j3D6Xq7drb
EY3h3C4QwIhXdq40GQvRjE2rovARjF9DLDt3e2L3TOkZL+VP3CJ8Y/ZaqVkh
2MXFy6F7aqwWgtR6oXmZm2wuynvoqbkxgo1G+15GJi3AEv/tv//WMpuHNZXT
eUf7+wf1QgX40XvXK2l/ErqARP9IeMd2/PPlzClCPgQolF5AvwcEirGrVy8P
jkZ/DpdHB0dH4fLZZ5+O6ktoVl9+sv9sHBGOOzK+ki/G7mDL9YzMq51upIvc
St7LJXDKh/DOHt3tfSVTma6t8Nt8CtLai3qtQeyBQ+xhFEVJkjCekrcyG0U3
c2kYUqlaIBGYoBzNTa0iS9fMLEUmp2tCHM+RJAgIL9q8hANd+hpm5wDkbM74
clnIzEUuvDIXPBeawbHszc3bC7YQlsPImUGY5UxCXrFmOb2qYB/Sm42ePfts
yCbNeXguLVuoB0HnCKaFWVLq1aJVZZmaukficSm0JGt4c7JZ8kywJ3Iohiz+
axI/dbrkYirLIBARTaFIKguAhkQtqsJKeD2IMAO2mgNFbMUNW4KmpKoMtCqV
ZUsFlkkRIO/bhczzQkTIwHPyTF5lZEMUfT2XEEeH/TpHCqhQQCUj9IOEympV
4mxmVe1mNlerAZNTxi2IrBgw5JQChOC2rJAw2rD7ErsY1Mw0XxW0fcHXjGeZ
MN5WxA6eL+la6vooQzu8R3IK+22A8d3A7VnWGuZKmPKPlhYeZC7a8EMgdHMv
A3QcIbKVDsJ6Jzl1UsEqg4cVwkiGqBVuvI5w4+VvwMKFboWj8ZDnakmmG1lm
wisKqUnGTaMYA7wLMbXkwx2gdZlFYPEH7MCWcYUAz5xUIMNBOLxL+UtrvMH0
kPUzyqeP9DDrqEXwHzhbOqJ3QRt7i3zAUo0MNJQFMMRYbOQaiQrcIts4zIdP
7NyFjXjmDl6cdBKxdiPQgOKEQ36sQJgwGtLmClQW1CO1yAR/5d4liPPWUo8j
T+WBoCnwkK+HBHqUc5QvWjbOvFPClEtjQ2wjGGoe1UQoH7/9cH1D9Zb+Z+8u
3fXV2V8+nF+dndL19ZvJxUVzEYU3rt9cfrg4ba/anS8v3749e3fqN2OV9Zai
+O3km9g7Pb58f3N++W5yEftQdkNGRsMtgKgELjSynRzFTZQLk2mZeqi8ePn+
P/8aHbKff/4Dkf9o9OyXX8INCsEhblYoS/40VQKs/hZuXkegSMEdqIBxRH8p
QVYEB2T+nLJ3LjTRyce35Jm7Mfs8zZajw+dhgQzuLdY+6y06n22vbG32Ttyx
tOOYxpu99Q1P9/WdfNO7r/3eWfz8iwKsw9BPffE8ChUppE1AL5LahHwjqqBa
hMAsDJtqtfCYR5mFyFPpyBYFe8AupLEDdm0pcQbsPddoNWwA6XVXPBY+qjk5
pKdHaocvPJW2vEEY4QTl5IEXFfiRS0ohsHLGS4JOnffIsJVC1VibMeIJNOyo
j5spj9qMolFpbHbZ73DRyAvk7CweQuQGNYGu4Ewk9EJBx7bU9umdnOmIrOdo
8g180SWOC76GRle1hm+cht49v2+I1z72nW0CNWKiY75t4GqusMu70r2RN4F0
PuegEAS9QEjJTl1RkQznp4VLVwH6oyYDCeU5SmnqL1oDqZpQu0npiPaVvcJD
QI26fjyE7txA90FbEkgA0rGiAGBmYKZaLpW2212Gt4Fkg3Yek465Q+8pb1hN
8M3jsPF3DDaNxZ6Yg/G6wVhG5nMqgtQkL33LgbV7Ue4o6tTX3ZFjOJsVKoX6
kmYgwpZm//vHP1n8cewP2n4cKrdaSAtK9G/1z5MhT6VGpIT3I+kvOHonsgYe
Oa08bjb2ok+psCkl0KL/zR1z2sbgXOT1vhxCKHphpBlsUMNuFLaFOC4VDX+P
sa8EpTKlBCHbOJxEc2cPRoO6wHWElKrztt3yQ3zmdXuhbIzE52n23ejgk2N+
MjpOTw6OMwTlu8NPPxuw2Vwe/3jy7FiffBv/afVtHP09/Ik8IyU3fDZmHx8H
nY8bbQesPeK43RV1Iler1Ikgglv3Lh6U/ZAeO3M6fkU8G5CiZ5DYRBEF4LzN
vhGud8DSner/tvYnm+pv0UPQOJxiSBnUYkHjuvzJleOgUAsClf4gMgvSbUTf
ku0n7Pa2if/ADe3ocG87IAiLd3cD7OhEkbbe3XU0Pe/yRKiY4A4iMo56oI1P
YDg2NNvBGO93KEw9tchs3R2v0Y8tQo73t7uajyChuhRgKcuO2H0Y/Ay7xdh4
F4i7na5C6vk82Wa5TSIKmRLA0Ks1G6MIp66uP8YNMej0Wtj6gU9M4kWojjqc
VtbRu+v8qY31o+V2+gya3CcudD1Zs39jHxrgBbhPwGNuCuoY5TkTfbDxvTG9
7g0d7MjYYHvDikGRULziev4zclbW/Ty3HX6iarR2TFwUdX+twvTAW6reoOde
CWs4vNnuq9SO7Prc+Zhce1JrWPvqpMH3895rQUpKWO68WuP+eQfab+vh14vu
QSqMk20HWyvtMFE30UO0DLULPK+rVPiImcoNEqWYuS8fwX3dWuYTYsNTARAb
1OYeBvT3WzjjBXtGqUd0n3ymKcV1KXW9WXAbS2CWpyCfJ5SMSA4oRqMQmeTH
eMoMIM5Nu3CExOvLKkU8i3rcdkc6oUMnP/j6d07QwlsHMJZrf144xSefeMww
lUATL63bY/i+wOWidF8mJbl4yD6UoXmhDHFG17UWcCb5ww0+oxjX/Y5yjVXw
VbcThdHXwrUt7GB4MDwktzYtRgjKSxR+x8fToKHXthd04vnfypJaV4IQfTNA
MSs68ZtzoCib83JGPUnPDEwfxDbNwAzXZPN2CO90R7ClmZ2dNllQfItQ2C7i
12KKtmMeoOtO8azjMsSzrk9zNI0YW5CRfhARWaXpkxPGZoM6rXlnUu4cLU2Y
w02VGistMSFpCZ/KvGFJU0tb4EDUT6om7aej+hOQqr8CAXd2bnwSt1+fUaao
9sNtukmU5vNP6IcFXKCA8nD+5rkwGAMBbxuF3qjgJgpceEJtK1FzOFkbkosa
0l5LElqBduK5eT9w/7JJhQeARjjmV2IbsqX22HZP3msag9udfTh8qdEAWeE/
CFkxI5uTWYXSgzuR+7KVV6Re5ZQI7kqpdxE6k1T46DsAvEvYkuDJD96QG8gw
Lt38xNUA48nNxfVTX8CcmU9amNJpt+HL891TUhZ7NMwqPNO6zw6izBOrEkFf
zfg69B3u052p4P7QhPjhr1PHedcnvZTn223G7qEpC5D2ejSNQVNHelLr4twd
Gpohta4qdRGGpTOBdsTNdE56h28D5Hp9C0W9bbj4jBP/Qh3MyBCAzn6KtpGa
DJ7dd0dA0Voj6GM9cXLb09H3zgU50ZEzfdlBnFD9HZ0fQwPSwvPju2ZOoE+d
2stwDHA+eTfZyv7vv/8+urk8vXzif3d4Oo7q3wJWq9VQT7PE/17lfgnALf0l
TLhfWj6iep84ACfuUosZOkC9dmL9R+oUhtLpk4w+Fxcin7mYRD+P/W9dIj+J
p5h6RfxL5DSBh+s3wVz/BwwcC43yGwAA

-->

</rfc>
