# json-ld-example.json -rw-r--r-- 19.2 KiB View raw
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
{
  "microdata": [
    {
      "@context": "https://schema.org",
      "@type": "WebPage",
      "author": {
        "@type": "Person",
        "id": "https://seirdy.one/#seirdy",
        "image": "https://seirdy.one/favicon.1250396055.png",
        "name": "Seirdy",
        "url": "https://seirdy.one/"
      },
      "breadcrumb": {
        "@type": "BreadcrumbList",
        "itemListElement": [
          {
            "@type": "ListItem",
            "item": "https://seirdy.one/posts/",
            "name": "Articles",
            "position": "1"
          },
          {
            "@type": "ListItem",
            "item": "https://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/",
            "name": "Takeaways from the Google Content Warehouse API documentation leak",
            "position": "2"
          }
        ]
      },
      "copyrightHolder": {
        "@type": "Person",
        "id": "https://seirdy.one/#seirdy",
        "image": "https://seirdy.one/favicon.1250396055.png",
        "name": "Seirdy",
        "url": "https://seirdy.one/"
      },
      "copyrightYear": "2024",
      "hasPart": [
        {
          "@type": "SiteNavigationElement",
          "name": "Articles",
          "url": "https://seirdy.one/posts/"
        },
        {
          "@type": "SiteNavigationElement",
          "name": "Notes",
          "url": "https://seirdy.one/notes/"
        },
        {
          "@type": "SiteNavigationElement",
          "name": "Bookmarks",
          "url": "https://seirdy.one/bookmarks/"
        },
        {
          "@type": "SiteNavigationElement",
          "name": "About",
          "url": "https://seirdy.one/about/"
        },
        {
          "@type": "SiteNavigationElement",
          "name": "Meta",
          "url": "https://seirdy.one/meta/"
        },
        {
          "@type": "SiteNavigationElement",
          "name": "Support",
          "url": "https://seirdy.one/support/"
        }
      ],
      "isPartOf": {
        "@type": [
          "https://schema.org/Blog",
          "https://schema.org/WebSite"
        ],
        "id": "https://seirdy.one/",
        "name": "Seirdy’s Home",
        "url": "https://seirdy.one/"
      },
      "license": {
        "@type": "CreativeWork",
        "name": "CC BY-SA 4.0",
        "url": "https://creativecommons.org/licenses/by-sa/4.0/"
      },
      "mainEntity": {
        "@type": "BlogPosting",
        "articleBody": "Introduction\n\nIn March, the official Elixir client for Google APIs received an accidental commit for internal non-public APIs. The commit added support for Google’s Content Warehouse API, which includes Google’s 14,000+ search ranking factors. Oops! Some people noticed this after its redaction earlier this month, and the news broke on May 28. You can read through the Content Warehouse API reference on HexDocs. I skimmed through these and read some blog posts by others who looked more deeply.\n\nIn particular, I referenced Secrets from the Algorithm: Google Search’s Internal Engineering Documentation Has Leaked by Mike King. Note that Mike King’s article doubles as an advertisement for his company’s services and for the legitimacy of search engine optimization (SEO) companies in general. I don’t endorse that message. I disagree with some of its claims, and elaborate on them in the coming sections. That said, I found the article well-researched. It cross-references information against other leaks, too.\n\nThoughts on individual ranking factors\n\nPermalink to section\n\nGoogle has over 14,000 ranking factors. I have not and will not read them all. I went through what other bloggers found notable, the PerDocData page, and what looked interesting when I searched for keywords I thought would reveal important ranking factors.\n\nSmall personal sites and commercial sites\n\nGoogle determines if your site is a small personal site note 1 and calculates a commercialScore in PerDocData which indicates [the] document is commercial (i.e. sells something). The docs have no information about whether either signal is positive or negative. Given how Google results look today and the language it uses in its documentation for manual reviewers, note 2 I conclude that personal sites don’t receive a significant boost. If anything, they may be demoted instead.\n\nI feel disappointed. I always considered the bias against small sites unintentionally emergent from them having no SEO budget. If a solution already exists, why doesn’t Google use it to even this gap? A more optimistic interpretation is that this factor will have weight when it’s ready and resistant to manipulation, but I don’t see incentives lining up to make that happen.\n\nFont size‽\n\nGoogle tracks “weighted font size” to notice key terms. Separation of content/semantics and form/presentation is baked into the DNA of the Web. Google should stick to semantic HTML elements such as <dfn> and <dt>, or at least <strong> and <em>.\n\nI worry that people will interpret this piece of API documentation as advice and run with it. Search engines have the power to incentivize good behavior, and this piece of information has the opposite effect. Visual emphasis should derive from semantic meaning, dammit!\n\nThis might have no weight in production. Perhaps Google uses the font size factor during A/B testing, comparing how results change when considering both styling and semantics. Google tracking something isn’t evidence that Google uses it in production. A closer read of the docs shows Google tracking ten font metrics, and I don’t believe that attributes such as medianLineSpan and fontId are ranking factors. It’s still plausible that font size impacts ranking since Google does track font size separately as an attribute of anchor text. note 3\n\nChrome user data\n\nGoogle uses Chrome and click data, much like how Brave Search uses Brave data. note 4 I don’t like this, as it lends itself to clickbait and chasing engagement rather than actual quality. At least, unlike Brave, Google doesn’t measure clicks on competitor engines. This contradicts many official docs and spokespeople. I would put a disclaimer like the one in the earlier section, but Mike King cross-referenced this against other leaks that confirm as much. Plausible deniability seems low.\n\nManual review\n\ngolden (type: boolean(), default: nil) - Flag for indicating that the document is a gold-standard document. This can be used for putting additional weight on human-labeled documents in contrast to automatically labeled annotations.\n\n— Google Content Warehouse API documentation: GoogleApi.ContentWarehouse.V1.Model.NlpSaftDocument\n\nThe existence of manual review to evaluate Google’s ranking has never been secret, but evidence that manually reviewed documents can have a ranking adjustment is new.\n\nManual ranking can combine with modifications to your ranking algorithm to bias your centrality algorithm around handpicked pages, which is how Marginalia achieves its anti-SEO bias. note 5 Personalized PageRank is one such algorithm, documented in the original PageRank paper. I like the use of manual review for “gold-standard documents” when applied to centrality algorithm biasing. However, I don’t know how I feel about manual reviewer scores directly appearing as a result ranking factor.\n\nLike font size, we don’t know whether Google actually uses manual review in production ranking factors. Google might catalog it here to run tests of expected-versus-actual ranking. Directly or indirectly, it shows that Google does take manual reviews into account in some way.\n\nBias against new sites\n\nIt’s not just you. Google has a bias against new sites due to their spam potential. note 6 Contrary to what official statements say, Google has a “sandbox” for new sites. Google also uses domain registration information. note 7 Mike King’s post says this comes from Google Domains itself, but I haven’t found evidence to back this up. Current domain registration records are public. An organization such as Google can use them to build a catalog of historical registration information without tapping into its domain registry. Anybody with whois can do this!\n\nTruncation\n\nGoogle does truncate pages to a certain number of tokens, note 8 like most engines, instead of reading long pages indefinitely. I find this strange: based on keyword matches, I’m sure Google has read to the end of some of my longest blog posts. Some fill almost 100 pages printed out (yeah…I have a problem). Google uses a limited number of historic versions of pages, note 9 so this isn’t due to historical versions of my page. Perhaps the token limit is just that high.\n\nAuthor name mismatches\n\nGoogle extracts the same piece of metadata (e.g., published/updated timestamps or author names) from wherever it exists (the URL, byline, natural-language processing, structured data, the sitemap, etc.). For authors, it does seem to care about mismatches. Public documentation allows an author entity to have many names, and this factor doesn’t necessarily contradict that. I imagine that ensuring author name consistency could create bias against people who do specify different authors in different parts of the same page (plural systems come to mind), especially when we consider false positives. I’m uncertain; this is speculation on my part.\n\nA cold shower: this isn’t as significant as some SEOs claim\n\nPermalink to section\n\nWe only have API documentation. We don’t know about any hidden knowledge, whether any of these factors have a ranking weight of “zero”, whether any of these conditionally apply, which are only used internally for testing, etc. As I said in prior disclaimers, some factors might exist for testing purposes. Serious conclusions drawn from this leak are, to some degree, speculation.\n\nI wouldn’t panic over how SEO companies use this leak to game the algorithm and ruin search more. Given their track record of missing the forest for the trees and the ever-changing hidden weighting factors we can’t see, we have little reason for concern. I imagine certain people in the SEO industry jumping to conclusions based on word choice in these API docs, not realizing how words’ original legacy meanings and current meanings are different.\n\nFor example, per-page metadata includes integer attributes such as crawlerPageRank and pagerank2, but PageRank is no longer a useful way to build a ranking algorithm for the entire Web. The attribute might no longer carry weight, or the decades-old PageRank centrality algorithm might not populate this anymore. To put this in perspective, the docs mention a HtmlrenderWebkitHeadlessProto but Google’s known to use a Chromium-based browser to render pages. Chromium hasn’t used WebKit in a decade; it hard-forked WebKit to make Blink in 2013.\n\nPer-page metadata also includes a toolbarPagerank integer attribute that hearkens back to the ancient Toolbar PageRank; this also probably doesn’t carry weight today. You can read more about Google’s use of PageRank and Toolbar in RIP Google PageRank score: A retrospective on how it ruined the web by Danny Sullivan.\n\nConclusion: my takeaways\n\nPermalink to section\n\nI still despise how the SEO industry and Google have started an arms race to incentivize making websites worse for actual users, selecting against small independent websites. I do maintain that we can carve out a non-toxic sliver of SEO: “search engine compatibility”. Few features uniquely belong in search engine, browser, reading mode, feed reader, social media link-preview, etc. compatibility. If you specifically ignore search engine compatibility but target everything else, you’ll implement it regardless. I call this principle “agent optimization”. I prefer the idea of optimizing for generic agents to optimizing for search engines, let alone one (1) search engine, in isolation. Naturally, user-agents (including browsers) come first; nothing should have significant conflict with them.\n\nIf you came to this article as an SEO, I don’t think I can convince you to stop. Instead, remember that it’s easy to miss the forest for the trees. Don’t lose sleep over one in fourteen thousand ranking criteria without other data backing up its importance and current relevance.\n\nConsider my rule of thumb, whose relevance will outlast this leak: assume Google looks at whatever information it can if it helps Google draw the conclusions its public guidelines say it tries to draw, even if those guidelines say it doesn’t use that information. The information Google uses differs from what it tells the public (yes, Google lied), and changes with time; however, Google’s intent makes for less of a moving target. This leak might contradict how Google determines what it should rank well, but not what it looks for. A good reference for what Google looks for is Google’s search rater guidelines for manual reviewers.\n\nGoogle lied, but don’t uncritically fall for the coming SEO hype.\n\nFootnotes\n\nSee the smallPersonalSite attribute of QualityNsrNsrData.\n\nBack to reference 1\n\nSee the conclusion, or snippets of the Google Search Central documentation such as this page describing the EEAT principle: experience, expertise, authoritativeness, and trustworthiness.\n\nBack to reference 2\n\nAnchorsAnchor has a fontSize member with no extra documentation.\n\nBack to reference 3\n\nI’d always assumed (in private, due to a lack of evidence) that the Chrome User Experience Report (CrUX) played a role in search rankings. I don’t know if or how this data overlaps with CrUX.\n\nBack to reference 4\n\nThe creator of Marginalia documents initial experiments in a 2021 blog post, and later confirmed this on “Hacker” “News”. In 2023, Marginalia switched away from PageRank to a different centrality algorithm.\n\nBack to reference 5\n\nSee the hostAge attribute of PerDocData.\n\nBack to reference 6\n\nSee RegistrationInfo. It defines createdDate and expiredDate attributes.\n\nBack to reference 7\n\nSee docs for numTokens in DocProperties: we drop some tokens in mustang and also truncate docs at a max cap.\n\nBack to reference 8\n\nSee the urlHistory attribute of CompositeDocIndexingInfo.\n\nBack to reference 9",
        "author": {
          "@type": "Person",
          "id": "https://seirdy.one/#seirdy",
          "image": "https://seirdy.one/favicon.1250396055.png",
          "name": "Seirdy",
          "url": "https://seirdy.one/"
        },
        "backstory": "Introduction\n\nIn March, the official Elixir client for Google APIs received an accidental commit for internal non-public APIs. The commit added support for Google’s Content Warehouse API, which includes Google’s 14,000+ search ranking factors. Oops! Some people noticed this after its redaction earlier this month, and the news broke on May 28. You can read through the Content Warehouse API reference on HexDocs. I skimmed through these and read some blog posts by others who looked more deeply.\n\nIn particular, I referenced Secrets from the Algorithm: Google Search’s Internal Engineering Documentation Has Leaked by Mike King. Note that Mike King’s article doubles as an advertisement for his company’s services and for the legitimacy of search engine optimization (SEO) companies in general. I don’t endorse that message. I disagree with some of its claims, and elaborate on them in the coming sections. That said, I found the article well-researched. It cross-references information against other leaks, too.",
        "comment": {
          "@type": "Comment",
          "accessibilitySummary": "This comment may have major formatting errors that could impact screen reader comprehension.",
          "author": {
            "@type": "Person",
            "name": "Seirdy"
          },
          "datePublished": "2024-08-02 04:05:13Z",
          "name": "wow i sure hope nobody notices i got the title wrong and edited it after posting",
          "text": "wow i sure hope nobody notices i got the title wrong and edited it after posting",
          "url": "https://brid.gy/comment/mastodon/@seirdy@pleroma.envs.net/AiPaH6gN6VboAEbG5I/AiPfTjjAGEH8VQkR3A"
        },
        "copyrightHolder": {
          "@type": "Person",
          "id": "https://seirdy.one/#seirdy",
          "image": "https://seirdy.one/favicon.1250396055.png",
          "name": "Seirdy",
          "url": "https://seirdy.one/"
        },
        "dateCreated": "2024-05-30T08:47:38Z",
        "dateModified": "2024-05-30T09:38:59Z",
        "datePublished": "2024-05-30T08:47:38Z",
        "discussionUrl": [
          "https://pleroma.envs.net/notice/AiPaH6gN6VboAEbG5I",
          "https://community.mojeek.com/t/takeaways-from-the-google-cloud-warehouse-api-documentation-leak/1079",
          "https://www.jstpst.net/f/just_post/10039/takeaways-from-the-google-cloud-warehouse-api-documentation"
        ],
        "headline": "Takeaways from the Google Content Warehouse API documentation leak",
        "id": "https://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/",
        "isPartOf": "https://seirdy.one/",
        "mentions": [
          {
            "@type": "BlogPosting",
            "author": {
              "@type": "Person",
              "familyName": "King",
              "givenName": "Mike",
              "name": "Mike King",
              "url": "https://ipullrank.com/author/ipullrank"
            },
            "name": "Secrets from the Algorithm: Google Search’s Internal Engineering Documentation Has Leaked",
            "url": "https://ipullrank.com/google-algo-leak"
          },
          {
            "@type": "Quotation",
            "citation": "Google Content Warehouse API documentation: GoogleApi.ContentWarehouse.V1.Model.NlpSaftDocument",
            "isPartOf": {
              "@type": "APIReference",
              "headline": "Google Content Warehouse API documentation: GoogleApi.ContentWarehouse.V1.Model.NlpSaftDocument",
              "name": "Google Content Warehouse API documentation: GoogleApi.ContentWarehouse.V1.Model.NlpSaftDocument",
              "url": "https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.NlpSaftDocument.html#module-attributes"
            },
            "text": "golden (type: boolean(), default: nil) - Flag for indicating that the document is a gold-standard document. This can be used for putting additional weight on human-labeled documents in contrast to automatically labeled annotations."
          },
          {
            "@type": "NewsArticle",
            "author": {
              "@type": "Person",
              "familyName": "Sullivan",
              "givenName": "Danny",
              "name": "Danny Sullivan",
              "url": "https://dannysullivan.com/"
            },
            "headline": "RIP Google PageRank score: A retrospective on how it ruined the web",
            "name": "RIP Google PageRank score: A retrospective on how it ruined the web",
            "url": "https://searchengineland.com/rip-google-pagerank-retrospective-244286"
          }
        ],
        "name": "Takeaways from the Google Content Warehouse API documentation leak",
        "potentialAction": {
          "@type": "CommentAction",
          "actionStatus": "PotentialActionStatus"
        },
        "timeRequired": "PT9M",
        "url": "https://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/",
        "wordCount": "1768"
      }
    }
  ],
  "status": "200 OK",
  "url": "https://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/"
}
# html-example.html -rw-r--r-- 38.6 KiB View raw
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-us" xml:lang="en-us" prefix="og: https://ogp.me/ns# article: https://ogp.me/ns/article# cc: http://creativecommons.org/ns#">
<head>
	<meta charset="UTF-8"/>
	<meta name="disabled-adaptations" content="watch"/>
	<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"/>
	<meta name="robots" content="index,follow,max-image-preview:large,max-snippet:-1,noai,noimageai,nocache"/>
	<style><!--/*--><![CDATA[/*><!--*/html{font:100%/1.5 sans-serif;overflow-y:scroll;-webkit-text-size-adjust:none;text-size-adjust:none}@media screen{body{margin:auto;max-width:40em;padding:0 14px}.e-content,[itemprop=dataFeedElement],.narrow{margin:auto;max-width:34em}body>:not(main),main>:not(article),li article,article>:not(h2):not(h3){contain:inline-size layout paint;padding:0 .5em}main>h1{padding-left:.25em}article>hr,body>hr,main>hr{margin:0 .5em}li .p-name+p,header hr{margin-bottom:0}h1{margin:0 0 .25em}dt,footer,h2,h3,li article,summary,[role=doc-endnotes]{content-visibility:auto;contain-intrinsic-size:auto 3em}dt,h3{contain-intrinsic-size:1.5em}footer,li article{contain-intrinsic-size:auto 16em}li article[itemtype="https://schema.org/SocialMediaPosting"]{contain-intrinsic-size:auto 36em}.tall,[role=doc-endnotes]{contain-intrinsic-size:auto 50em}article,body,dt,dd,h1,h2,h3,main,pre,summary,[role=doc-endnotes],[role=doc-preface]{contain:inline-size layout paint}figure,:not(li)>p{contain:inline-size layout}article>h2{margin:.25em 0;padding:.25em 0}details,fieldset,form{margin:.5em 0}input,summary,aside>a,dt>a,:not(h1)+ul>li>a,ol>li>a,nav li>a,.u-comment dd>a,[itemprop=breadcrumb] a,[itemprop=breadcrumb]>span{padding:.75em .25em}dt{padding:1em .5em;margin:-.25em 0 -.25em -.5em}dd{margin:0;padding:.25em .25em .5em 1.75em}aside>a,dt>a{contain:content;margin:-.75em -.25em}h2+aside[role=none]{contain:strict;content-visibility:auto;height:1.75em;margin:-1em -.5em;padding:1em .5em}header>nav,a[href="#h1"],.u-comment dd>a,footer>nav,li>a,aside>a,nav ol a{display:inline-block;margin-left:-.25em}h1+ul a{margin-left:0}h2>a{contain:content;display:inline-block;margin:0 .125em;padding:.25em}h3>a{contain:content;display:inline-block;padding:.5em .25em}article>h3{padding:.25em;margin:0 0 0 -.5em}[role=doc-backlink],section article p{margin-left:-.5em}header>nav,nav[itemprop=breadcrumb]{padding:.75em 0 .25em}dt+dt{padding-top:.75em;margin-top:-.75em}dt+dt>a{padding-top:0}:not(nav)>:not(h1)+ul li>a,nav:not([itemprop=breadcrumb]) li,ol li>a{margin:.25em}[role=doc-backlink]{contain:content;display:inline-block;padding:.75em .5em;margin-top:-1em}a[href="#h1"]{contain:content;content-visibility:auto;padding:0 .25em;position:absolute;top:-2em}a[href="#h1"]:focus{top:0}}sup>a{margin-left:.25em;padding-bottom:.5em}sup{font-size:.85em;line-height:0}ol,ul,li h2+ul{padding-left:1.75em}blockquote,ol ol,ul ul{-webkit-hyphens:auto;hyphens:auto;margin:0;padding-left:1.25em}nav ul{margin:0;padding:0}[itemprop=breadcrumb] ol,[itemprop=breadcrumb] li,[itemprop=breadcrumb]>span,nav ul li,dt>a{display:inline-block}[itemprop=breadcrumb] ol{margin:-.25em;padding:0}[itemprop=breadcrumb] li:not(:last-of-type)::after{content:"→"}blockquote{border-left:3px solid}h1{-webkit-hyphens:auto;hyphens:auto}@media(max-width:272px){body{-webkit-hyphens:auto;hyphens:auto;padding:0 6px}li>a,[itemprop=breadcrumb] a,[itemprop=breadcrumb]>span{padding:.25em}dd{padding-left:1em}hr{margin:.25em 0}h2+aside[role=none]{contain:inline-size layout paint}}kbd{font-weight:700}ins,[role=note],[role=doc-tip]{contain:content;font-style:italic;text-decoration:none}figure,section[itemprop=mentions]{margin:1.5em 0}figure[itemtype="https://schema.org/ImageObject"]{margin:1.5em}section[itemprop=mentions]>figure{margin:0}code,kbd,pre,samp{font-family:monospace,monospace}[hidden],[type=hidden]{display:none}.h-feed>ol{list-style-type:none;margin:0;padding:0}.u-comment,:not(pre)>code,:not(pre)>samp,span[itemtype="https://schema.org/Person"]{overflow-wrap:break-word}pre{overflow:auto;padding:.5em}input,img,mark,pre,summary{border:thin solid}details,:not(pre)>code,:not(pre)>samp{border:thin solid #999;padding:0 .25em}summary{margin:0 -.25em}.e-content img{display:block;height:auto;margin:auto;max-width:100%}.h-card .u-photo{height:1em;width:1em;vertical-align:-.1em}.p-author a.u-uid{text-decoration:none}a .u-photo+.p-name{text-decoration:underline}audio{width:100%}.pix{image-rendering:pixelated}legend,form>div{display:table;width:100%}input{font-family:sans-serif;font-size:inherit}input:not([type=submit]){display:table-cell;width:98%}form>div>div{display:table-cell;vertical-align:top;width:1%}a:focus,summary:focus,[tabindex="0"]:focus,form :focus{outline:3px solid}@supports selector(:focus-visible){a:focus:not(:focus-visible),[tabindex="0"]:focus:not(:focus-visible){outline:none}}@media(prefers-color-scheme:dark){button,html,input{background-color:#191919;color:#e6e6e6}mark{color:#000;background-color:#eee8a7}a:link{color:#eee8a7}a:visited{color:#ffd3ff}@media not (prefers-contrast){sup a:link:not(:active){color:#feb}sup a:visited:not(:active){color:#ffe6ff}}@media(prefers-contrast:less){html,input{background-color:#444}}@media(prefers-contrast:more){html,input{background-color:#0d0d0d;color:#f3f3f3}a:link{color:#fff970}a:visited{color:#ccfdff}}a:active{color:#f83}}@media print{summary{list-style:none}#toc,[href="#h1"],[role=doc-backlink],aside:not([role=note]),article summary,section[aria-labelledby=webmentions],footer,body>hr,main[itemprop]>article+hr,nav:not([itemprop=breadcrumb]) a:not([rel=home]){display:none}[role=note] p,[role=doc-tip] p{margin:.25em 0}}figure,blockquote,section[itemprop=mentions],li{break-inside:avoid}/*]]>*/--></style>
	<link href="https://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/" rel="canonical"/>
	<link href="https://collector.seirdy.one/webmentions/receive" rel="webmention"/>
	<link href="https://webmention.io/webmention?forward=https://collector.seirdy.one/webmentions/receive" rel="pingback"/>
	<link rel="authorization_endpoint" href="https://indieauth.com/auth"/>
	<link href="/manifest.1941423154.webmanifest" rel="manifest"/>
	<link rel="alternate" type="application/atom+xml" href="https://seirdy.one/posts/atom.xml" title="Articles"/>
	<link rel="alternate" type="application/atom+xml" href="https://seirdy.one/atom.xml" title="All content"/>
	<link rel="alternate" type="application/atom+xml" href="https://seirdy.one/notes/atom.xml" title="Notes"/>
	<title>Takeaways from the Google Content Warehouse API documentation leak</title>
	<meta name="description" content="My thoughts on Google's Content Warehouse API doc leak, what we can learn from its ranking factors, and why the following SEO hype is overblown."/>
	<meta name="author" content="Seirdy"/>
	<meta name="fediverse:creator" content="@Seirdy@pleroma.envs.net"/>
	<meta property="article:author" content="Seirdy"/>
	<meta property="article:published_time" content="2024-05-30T08:47:38Z"/>
	<meta property="article:modified_time" content="2024-05-30T09:38:59Z"/>
	<link rel="icon" sizes="any" href="/favicon.2229316949.svg" type="image/svg+xml"/>
	<link rel="icon" sizes="192x192" href="/favicon192.3669199476.png" type="image/png"/>
	<meta name="color-scheme" content="light dark"/>
	<meta name="format-detection" content="telephone=no"/>
	<meta name="theme-color" content="#191919" media="(prefers-color-scheme:dark)"/>
	<meta name="theme-color" content="#fff" media="(prefers-color-scheme:light)"/>
	<meta property="og:title" content="Takeaways from the Google Content Warehouse API documentation leak"/>
	<meta property="og:site_name" content="Seirdy’s Home"/>
	<meta property="og:type" content="article"/>
	<meta property="og:image" content="https://seirdy.one/favicon512.2364016307.png"/>
	<meta property="og:image:type" content="image/png"/>
	<meta property="og:image:height" content="512"/>
	<meta property="og:image:width" content="512"/>
	<meta property="og:url" content="https://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/"/>
	<meta property="og:description" content="My thoughts on Google's Content Warehouse API doc leak, what we can learn from its ranking factors, and why the following SEO hype is overblown."/>
	<meta name="generator" content="Hugo 0.132.0-DEV"/>
</head>
<body itemscope="" itemtype="https://schema.org/WebPage">
	<header>
		<a href="#h1">Skip to content</a>
		<nav aria-label="Global">
			<ul>
				<li itemprop="isPartOf" itemscope="" itemtype="https://schema.org/Blog https://schema.org/WebSite" itemid="https://seirdy.one/">
					<a rel="home" itemprop="url" href="https://seirdy.one/">
						<span itemprop="name">Seirdy’s Home</span>
					</a>
				</li>
				<li itemprop="hasPart" itemscope="" itemtype="https://schema.org/SiteNavigationElement">
					<a href="https://seirdy.one/posts/" itemprop="url" rel="feed">
						<strong itemprop="name">Articles</strong>
					</a>
				</li>
				<li itemprop="hasPart" itemscope="" itemtype="https://schema.org/SiteNavigationElement">
					<a href="https://seirdy.one/notes/" itemprop="url" rel="feed">
						<span itemprop="name">Notes</span>
					</a>
				</li>
				<li itemprop="hasPart" itemscope="" itemtype="https://schema.org/SiteNavigationElement">
					<a href="https://seirdy.one/bookmarks/" itemprop="url" rel="feed">
						<span itemprop="name">Bookmarks</span>
					</a>
				</li>
				<li itemprop="hasPart" itemscope="" itemtype="https://schema.org/SiteNavigationElement">
					<a href="https://seirdy.one/about/" itemprop="url">
						<span itemprop="name">About</span>
					</a>
				</li>
				<li itemprop="hasPart" itemscope="" itemtype="https://schema.org/SiteNavigationElement">
					<a href="https://seirdy.one/meta/" itemprop="url">
						<span itemprop="name">Meta</span>
					</a>
				</li>
				<li itemprop="hasPart" itemscope="" itemtype="https://schema.org/SiteNavigationElement">
					<a href="https://seirdy.one/support/" itemprop="url">
						<span itemprop="name">Support</span>
					</a>
				</li>
			</ul>
		</nav>
	</header>
	<main itemprop="mainEntity" itemscope="" itemtype="https://schema.org/BlogPosting" itemid="https://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/">
		<link itemprop="isPartOf" href="https://seirdy.one/"/>
		<article class="h-entry hentry">
			<header>
				<h1 itemprop="name headline" class="p-name entry-title" id="h1" tabindex="-1">Takeaways from the Google Content Warehouse API documentation leak</h1>
				<ul>
					<li>Posted <time itemprop="dateCreated datePublished" class="dt-published published" datetime="2024-05-30T08:47:38Z">2024-05-30</time> by <span itemprop="author copyrightHolder" itemscope="" itemtype="https://schema.org/Person" itemid="https://seirdy.one/#seirdy" class="h-card p-author author vcard"><a itemprop="url" href="https://seirdy.one/" rel="author me home cc:attributionURL" class="u-url u-uid url" property="cc:attributionName"><img itemprop="image" width="16" height="16" alt="" src="https://seirdy.one/favicon.1250396055.png" class="u-photo photo"/> <span itemprop="name" class="p-name p-nickname nickname fn">Seirdy</span></a></span> on his <a rel="bookmark" itemprop="url" class="u-url url" href="https://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/">Website</a> and <a rel="syndication" class="u-syndication" href="gemini://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/index.gmi">Gemini capsule</a>.
</li>
					<li>
		Last updated <time itemprop="dateModified" class="dt-updated updated" datetime="2024-05-30T09:38:59Z">2024-05-30</time>. <a href="https://git.sr.ht/~seirdy/seirdy.one/log/master/item/content/posts/google-document-warehouse-api-docs-leak.md">Changelog</a>
</li>
					<li><data itemprop="wordCount" value="1768">1768</data> words; a short <time itemprop="timeRequired" datetime="PT9M">9 minute</time> read</li>
				</ul>
			</header>
			<hr/>
			<div class="e-content entry-content" itemprop="articleBody">
				<section role="doc-introduction" itemprop="backstory">
					<h2 id="Introduction">Introduction</h2>
					<p>In March, the official Elixir client for Google APIs <a href="https://github.com/googleapis/elixir-google-api/commit/d7a637f4391b2174a2cf43ee11e6577a204a161e">received an accidental commit for internal non-public APIs</a>. The commit added support for Google’s Content Warehouse API, which includes Google’s 14,000+ search ranking factors. Oops! Some people noticed this after its redaction earlier this month, and the news broke on May 28. You can read through the <a href="https://hexdocs.pm/google_api_content_warehouse/0.4.0/api-reference.html">Content Warehouse API reference on HexDocs</a>. I skimmed through these and read some blog posts by others who looked more deeply.</p>
					<p>In particular, I referenced <span class="h-cite" itemprop="mentions" itemscope="" itemtype="https://schema.org/BlogPosting"><cite itemprop="name" class="p-name"><a class="u-url" itemprop="url" href="https://ipullrank.com/google-algo-leak">Secrets from the Algorithm: Google Search’s Internal Engineering Documentation Has Leaked</a></cite> by <span itemprop="author" itemscope="" itemtype="https://schema.org/Person" class="h-card vcard p-author"><a itemprop="url" href="https://ipullrank.com/author/ipullrank" class="u-url url"><span itemprop="name" class="p-name fn n"><span itemprop="givenName" class="p-given-name given-name">Mike</span>&#160;<span itemprop="familyName" class="p-family-name family-name">King</span></span></a></span></span>. Note that Mike King’s article doubles as an advertisement for his company’s services and for the legitimacy of search engine optimization (<abbr>SEO</abbr>) companies in general. I don’t endorse that message. I disagree with some of its claims, and elaborate on them in the coming sections. That said, I found the article well-researched. It cross-references information against other leaks, too.</p>
				</section>
				<h2 id="thoughts-on-individual-ranking-factors" tabindex="-1">Thoughts on individual ranking factors</h2>
				<aside role="none">
					<a href="#thoughts-on-individual-ranking-factors" aria-labelledby="thoughts-on-individual-ranking-factors-prefix thoughts-on-individual-ranking-factors">
						<span id="thoughts-on-individual-ranking-factors-prefix">Permalink to section</span>
					</a>
				</aside>
				<p>Google has over 14,000 ranking factors. I have not and will not read them all. I went through what other bloggers found notable, the <a href="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.PerDocData.html"><code>PerDocData</code> page</a>, and what looked interesting when I searched for keywords I thought would reveal important ranking factors.</p>
				<h3 id="small-personal-sites-and-commercial-sites" tabindex="-1">Small personal sites and commercial sites</h3>
				<p>Google determines if your site is a <q cite="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.QualityNsrNsrData.html">small personal site</q><sup><a href="#fn:1" id="fnref:1" role="doc-noteref">note 1</a></sup> and calculates a <code>commercialScore</code> in <code>PerDocData</code> which <q cite="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.PerDocData.html#module-attributes">indicates [the] document is commercial (i.e. sells something)</q>. The docs have no information about whether either signal is positive or negative. Given how Google results look today and the language it uses in its documentation for manual reviewers,<sup><a href="#fn:2" id="fnref:2" role="doc-noteref">note 2</a></sup> I conclude that personal sites don’t receive a significant boost. If anything, they may be demoted instead.</p>
				<p>I feel disappointed. I always considered the bias against small sites unintentionally emergent from them having no SEO budget. If a solution already exists, why doesn’t Google use it to even this gap? A more optimistic interpretation is that this factor will have weight when it’s ready and resistant to manipulation, but I don’t see incentives lining up to make that happen.</p>
				<h3 id="font-size" tabindex="-1">Font size‽</h3>
				<p>Google tracks “weighted font size” to notice key terms. Separation of content/semantics and form/presentation is baked into the DNA of the Web. Google should stick to semantic HTML elements such as <code>&lt;dfn&gt;</code> and <code>&lt;dt&gt;</code>, or at <em>least</em> <code>&lt;strong&gt;</code> and <code>&lt;em&gt;</code>.</p>
				<p>I worry that people will interpret this piece of API documentation as advice and run with it. <a href="/notes/2022/08/02/accessibility-and-search-indexes/">Search engines have the power to incentivize good behavior</a>, and this piece of information has the opposite effect. Visual emphasis should derive from semantic meaning, dammit!</p>
				<p>This might have no weight in production. Perhaps Google uses the font size factor during A/B testing, comparing how results change when considering both styling and semantics. Google <em>tracking</em> something isn’t evidence that Google uses it in production. <a href="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.GoodocFontSizeStats.html">A closer read of the docs shows Google tracking ten font metrics</a>, and I don’t believe that attributes such as <code>medianLineSpan</code> and <code>fontId</code> are ranking factors. It’s still plausible that font size impacts ranking since Google does track font size separately as an attribute of anchor text.<sup><a href="#fn:3" id="fnref:3" role="doc-noteref">note 3</a></sup></p>
				<h3 id="chrome-user-data" tabindex="-1">Chrome user data</h3>
				<p>Google uses Chrome and click data, much like how Brave Search uses Brave data.<sup><a href="#fn:4" id="fnref:4" role="doc-noteref">note 4</a></sup> I don’t like this, as it lends itself to clickbait and chasing engagement rather than actual quality. At least, unlike Brave, Google doesn’t measure clicks on <em>competitor engines.</em> This contradicts <em>many</em> official docs and spokespeople. I would put a disclaimer like the one in the earlier section, but Mike King cross-referenced this against other leaks that confirm as much. Plausible deniability seems low.</p>
				<h3 id="manual-review" tabindex="-1">Manual review</h3>
				<figure itemprop="mentions" itemscope="" itemtype="https://schema.org/Quotation">
					<blockquote itemprop="text">
						<p><code>golden</code> (type: <code>boolean()</code>, default: <code>nil</code>) - Flag for indicating that the document is a gold-standard document. This can be used for putting additional weight on human-labeled documents in contrast to automatically labeled annotations.</p>
					</blockquote>
					<figcaption><span class="h-cite" itemprop="citation" role="doc-credit"><span itemprop="isPartOf" itemscope="" itemtype="https://schema.org/APIReference"><cite itemprop="name headline" class="p-name"><a class="u-url" itemprop="url" href="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.NlpSaftDocument.html#module-attributes">Google Content Warehouse API documentation: GoogleApi.ContentWarehouse.V1.Model.NlpSaftDocument</a></cite></span></span>
</figcaption>
				</figure>
				<p>The existence of manual review to evaluate Google’s ranking has never been secret, but evidence that manually reviewed documents can have a ranking adjustment is new.</p>
				<p>Manual ranking can combine with modifications to your ranking algorithm to bias your centrality algorithm around handpicked pages, which is how Marginalia achieves its anti-SEO bias.<sup><a href="#fn:5" id="fnref:5" role="doc-noteref">note 5</a></sup> Personalized PageRank is one such algorithm, documented in the original PageRank paper. I like the use of manual review for “gold-standard documents” when applied to centrality algorithm biasing. However, I don’t know how I feel about manual reviewer scores directly appearing as a result ranking factor.</p>
				<p>Like font size, we don’t know whether Google actually uses manual review in production ranking factors. Google might catalog it here to run tests of expected-versus-actual ranking. Directly or indirectly, it shows that Google does take manual reviews into account in some way.</p>
				<h3 id="bias-against-new-sites" tabindex="-1">Bias against new sites</h3>
				<p>It’s not just you. Google has a bias against new sites due to their spam potential.<sup><a href="#fn:6" id="fnref:6" role="doc-noteref">note 6</a></sup> Contrary to what official statements say, Google has a “sandbox” for new sites. Google also uses domain registration information.<sup><a href="#fn:7" id="fnref:7" role="doc-noteref">note 7</a></sup> Mike King’s post says this comes from Google Domains itself, but I haven’t found evidence to back this up. Current domain registration records are public. An organization such as Google can use them to build a catalog of historical registration information without tapping into its domain registry. Anybody with <code>whois</code> can do this!</p>
				<h3 id="truncation" tabindex="-1">Truncation</h3>
				<p>Google does truncate pages to a certain number of tokens,<sup><a href="#fn:8" id="fnref:8" role="doc-noteref">note 8</a></sup> like most engines, instead of reading long pages indefinitely. I find this strange: based on keyword matches, I’m sure Google has read to the end of some of my longest blog posts. Some fill almost 100 pages printed out (yeah…I have a problem). Google uses a limited number of historic versions of pages,<sup><a href="#fn:9" id="fnref:9" role="doc-noteref">note 9</a></sup> so this isn’t due to historical versions of my page. Perhaps the token limit is just that high.</p>
				<h3 id="author-name-mismatches" tabindex="-1">Author name mismatches</h3>
				<p>Google extracts the same piece of metadata (e.g., published/updated timestamps or author names) from wherever it exists (the URL, byline, natural-language processing, structured data, the sitemap, etc.). For authors, it does seem to care about mismatches. Public documentation allows an author entity to have many names, and this factor doesn’t necessarily contradict that. I imagine that ensuring author name consistency could create bias against people who do specify different authors in different parts of the same page (plural systems come to mind), especially when we consider false positives. I’m uncertain; this is speculation on my part.</p>
				<h2 id="a-cold-shower-this-isnt-as-significant-as-some-seos-claim" tabindex="-1">A cold shower: this isn’t as significant as some SEOs claim</h2>
				<aside role="none">
					<a href="#a-cold-shower-this-isnt-as-significant-as-some-seos-claim" aria-labelledby="a-cold-shower-this-isnt-as-significant-as-some-seos-claim-prefix a-cold-shower-this-isnt-as-significant-as-some-seos-claim">
						<span id="a-cold-shower-this-isnt-as-significant-as-some-seos-claim-prefix">Permalink to section</span>
					</a>
				</aside>
				<p>We only have API documentation. We don’t know about any hidden knowledge, whether any of these factors have a ranking weight of “zero”, whether any of these conditionally apply, which are only used internally for testing, etc. As I said in prior disclaimers, some factors might exist for testing purposes. Serious conclusions drawn from this leak are, to some degree, speculation.</p>
				<p>I wouldn’t panic over how SEO companies use this leak to game the algorithm and ruin search more. Given their track record of missing the forest for the trees and the ever-changing hidden weighting factors we can’t see, we have little reason for concern. I imagine certain people in the SEO industry jumping to conclusions based on word choice in these API docs, not realizing how words’ original legacy meanings and current meanings are different.</p>
				<p>For example, per-page metadata includes integer attributes such as <code>crawlerPageRank</code> and <code>pagerank2</code>, but PageRank is no longer a useful way to build a ranking algorithm for the entire Web. The attribute might no longer carry weight, or the decades-old PageRank centrality algorithm might not populate this anymore. To put this in perspective, the docs mention a <code>HtmlrenderWebkitHeadlessProto</code> but Google’s known to use a Chromium-based browser to render pages. Chromium hasn’t used WebKit in a decade; it hard-forked WebKit to make Blink in 2013.</p>
				<p>Per-page metadata also includes a <code>toolbarPagerank</code> integer attribute that hearkens back to the ancient Toolbar PageRank; this also probably doesn’t carry weight today. You can read more about Google’s use of PageRank and Toolbar in <span class="h-cite" itemprop="mentions" itemscope="" itemtype="https://schema.org/NewsArticle"><cite itemprop="name headline" class="p-name"><a class="u-url" itemprop="url" href="https://searchengineland.com/rip-google-pagerank-retrospective-244286">RIP Google PageRank score: A retrospective on how it ruined the web</a></cite> by <span itemprop="author" itemscope="" itemtype="https://schema.org/Person" class="h-card vcard p-author"><a itemprop="url" href="https://dannysullivan.com/" class="u-url url"><span itemprop="name" class="p-name fn n"><span itemprop="givenName" class="p-given-name given-name">Danny</span>&#160;<span itemprop="familyName" class="p-family-name family-name">Sullivan</span></span></a></span></span>.</p>
				<section role="doc-conclusion">
					<h2 id="conclusion-my-takeaways" tabindex="-1">Conclusion: my takeaways</h2>
					<aside role="none">
						<a href="#conclusion-my-takeaways" aria-labelledby="conclusion-my-takeaways-prefix conclusion-my-takeaways">
							<span id="conclusion-my-takeaways-prefix">Permalink to section</span>
						</a>
					</aside>
					<p>I still despise how the SEO industry and Google have started an arms race to incentivize making websites worse for actual users, selecting against small independent websites. I do maintain that we can carve out a non-toxic sliver of SEO: “search engine compatibility”. Few features uniquely belong in search engine, browser, reading mode, feed reader, social media link-preview, etc. compatibility. If you specifically ignore search engine compatibility but target everything else, you’ll implement it regardless. <a href="/notes/2022/06/23/agent-optimization/">I call this principle <dfn>“agent optimization”</dfn></a>. I prefer the idea of optimizing for generic agents to optimizing for search engines, let alone one (1) search engine, in isolation. Naturally, user-agents (including browsers) come first; nothing should have significant conflict with them.</p>
					<p>If you came to this article as an SEO, I don’t think I can convince you to stop. Instead, remember that it’s easy to miss the forest for the trees. Don’t lose sleep over <em>one in fourteen thousand ranking criteria</em> without other data backing up its importance and current relevance.</p>
					<p>Consider my rule of thumb, whose relevance will outlast this leak: assume Google looks at whatever information it can <em>if it helps Google draw the conclusions its public guidelines say it tries to draw,</em> even if those guidelines say it doesn’t use that information. The information Google uses differs from what it tells the public (yes, Google lied), and changes with time; however, Google’s intent makes for less of a moving target. This leak might contradict <em>how</em> Google determines what it should rank well, but not <em>what</em> it looks for. A good reference for what Google looks for is <a href="https://static.googleusercontent.com/media/guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf">Google’s search rater guidelines</a> for manual reviewers.</p>
					<p>Google lied, but don’t uncritically fall for the coming SEO hype.</p>
				</section>
				<hr/>
				<section role="doc-endnotes" aria-labelledby="note-hd">
					<h2 id="note-hd">Footnotes</h2>
					<ol>
						<li id="fn:1" tabindex="-1">
							<p>See <a href="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.QualityNsrNsrData.html">the <code>smallPersonalSite</code> attribute of <code>QualityNsrNsrData</code></a>. </p>
							<a href="#fnref:1" aria-labelledby="bl1-1 bl2-1" role="doc-backlink">
								<span id="bl1-1">Back</span>
								<span id="bl2-1" hidden=""> to reference 1</span>
							</a>
						</li>
						<li id="fn:2" tabindex="-1">
							<p>See the conclusion, or snippets of the Google Search Central documentation <a href="https://developers.google.com/search/docs/fundamentals/creating-helpful-content">such as this page describing the <abbr>EEAT</abbr> principle: experience, expertise, authoritativeness, and trustworthiness</a>. </p>
							<a href="#fnref:2" aria-labelledby="bl1-2 bl2-2" role="doc-backlink">
								<span id="bl1-2">Back</span>
								<span id="bl2-2" hidden=""> to reference 2</span>
							</a>
						</li>
						<li id="fn:3" tabindex="-1">
							<p><a href="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.AnchorsAnchor.html#module-attributes"><code>AnchorsAnchor</code> has a <code>fontSize</code> member</a> with no extra documentation. </p>
							<a href="#fnref:3" aria-labelledby="bl1-3 bl2-3" role="doc-backlink">
								<span id="bl1-3">Back</span>
								<span id="bl2-3" hidden=""> to reference 3</span>
							</a>
						</li>
						<li id="fn:4" tabindex="-1">
							<p>I’d always assumed (in private, due to a lack of evidence) that the Chrome User Experience Report (<abbr>CrUX</abbr>) played a role in search rankings. I don’t know if or how this data overlaps with CrUX. </p>
							<a href="#fnref:4" aria-labelledby="bl1-4 bl2-4" role="doc-backlink">
								<span id="bl1-4">Back</span>
								<span id="bl2-4" hidden=""> to reference 4</span>
							</a>
						</li>
						<li id="fn:5" tabindex="-1">
							<p><a href="https://www.marginalia.nu/log/26-personalized-pagerank/">The creator of Marginalia documents initial experiments in a 2021 blog post</a>, and later <a href="https://news.ycombinator.com/item?id=32349094">confirmed this on “Hacker” “News”</a>. In 2023, <a href="https://www.marginalia.nu/log/73-new-approach-to-ranking/">Marginalia switched away from PageRank to a different centrality algorithm</a>. </p>
							<a href="#fnref:5" aria-labelledby="bl1-5 bl2-5" role="doc-backlink">
								<span id="bl1-5">Back</span>
								<span id="bl2-5" hidden=""> to reference 5</span>
							</a>
						</li>
						<li id="fn:6" tabindex="-1">
							<p>See <a href="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.PerDocData.html#module-attributes">the <code>hostAge</code> attribute of <code>PerDocData</code></a>. </p>
							<a href="#fnref:6" aria-labelledby="bl1-6 bl2-6" role="doc-backlink">
								<span id="bl1-6">Back</span>
								<span id="bl2-6" hidden=""> to reference 6</span>
							</a>
						</li>
						<li id="fn:7" tabindex="-1">
							<p>See <a href="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.RegistrationInfo.html"><code>RegistrationInfo</code></a>. It defines <code>createdDate</code> and <code>expiredDate</code> attributes. </p>
							<a href="#fnref:7" aria-labelledby="bl1-7 bl2-7" role="doc-backlink">
								<span id="bl1-7">Back</span>
								<span id="bl2-7" hidden=""> to reference 7</span>
							</a>
						</li>
						<li id="fn:8" tabindex="-1">
							<p>See docs for <a href="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.PerDocData.html#module-attributes"><code>numTokens</code> in <code>DocProperties</code></a>: <q cite="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.DocProperties.html#module-attributes">we drop some tokens in mustang and also truncate docs at a max cap</q>. </p>
							<a href="#fnref:8" aria-labelledby="bl1-8 bl2-8" role="doc-backlink">
								<span id="bl1-8">Back</span>
								<span id="bl2-8" hidden=""> to reference 8</span>
							</a>
						</li>
						<li id="fn:9" tabindex="-1">
							<p>See <a href="https://hexdocs.pm/google_api_content_warehouse/0.4.0/GoogleApi.ContentWarehouse.V1.Model.CompositeDocIndexingInfo.html#module-attributes">the <code>urlHistory</code> attribute of <code>CompositeDocIndexingInfo</code></a>. </p>
							<a href="#fnref:9" aria-labelledby="bl1-9 bl2-9" role="doc-backlink">
								<span id="bl1-9">Back</span>
								<span id="bl2-9" hidden=""> to reference 9</span>
							</a>
						</li>
					</ol>
				</section>
			</div>
			<hr/>
			<footer>
				<h2 id="interact" tabindex="-1">Interact</h2>
				<p>You can interact by <a href="#webmentions">sending webmentions</a> or by visiting a syndicated copy of this post.</p>
				<h3>Syndication</h3>
				<p>This post has been syndicated to:</p>
				<ul>
					<li>
						<a itemprop="discussionUrl" class="u-syndication" rel="syndication" href="https://pleroma.envs.net/notice/AiPaH6gN6VboAEbG5I">The Fediverse</a>
					</li>
					<li>
						<a itemprop="discussionUrl" class="u-syndication" rel="syndication" href="https://community.mojeek.com/t/takeaways-from-the-google-cloud-warehouse-api-documentation-leak/1079">The Mojeek Discourse</a>
					</li>
					<li>
						<a itemprop="discussionUrl" class="u-syndication" rel="syndication" href="https://www.jstpst.net/f/just_post/10039/takeaways-from-the-google-cloud-warehouse-api-documentation">jstpst</a>
					</li>
				</ul>
				<h3 id="webmentions" tabindex="-1">Web­mentions</h3>
				<p>This site supports <a href="https://indieweb.org/Webmention">Webmentions</a>, a backlink-based alternative to traditional comment forms.</p>
				<details>
					<summary>Send a Webmention</summary>
					<fieldset>
						<legend>Publish a response on your own website, and link back to this page's canonical location. Then share your link here to turn it into a Webmention.</legend>
						<form itemprop="potentialAction" itemscope="" itemtype="https://schema.org/CommentAction" action="https://collector.seirdy.one/webmentions/receive" method="post">
							<meta itemprop="actionStatus" content="PotentialActionStatus"/>
							<input type="hidden" name="target" value="https://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/"/>
							<label for="menchie">URL of page linking here</label>
							<div>
								<input id="menchie" type="url" autocomplete="on" required="" name="source"/>
								<div>
									<input type="submit" value="submit"/>
								</div>
							</div>
						</form>
					</fieldset>
				</details>
				<details>
					<summary>Toggle 1 Webmentions</summary>
					<p>Webmentions received for this post appear in the following list after I approve them. I sometimes send Webmentions to myself on behalf of linking sites that don’t support them. I auto-replace broken links with <a href="https://web.archive.org/">Wayback Machine</a> snapshots, if they exist.</p>
					<dl>
						<div itemprop="comment" itemscope="" itemtype="https://schema.org/Comment" class="u-comment h-cite">
							<dt>
								<time class="dt-published" itemprop="datePublished" datetime="2024-08-02 04:05:13Z">
					2024-08-02
				</time>
							</dt>
							<dd><a class="u-url" itemprop="url" href="https://brid.gy/comment/mastodon/@seirdy@pleroma.envs.net/AiPaH6gN6VboAEbG5I/AiPfTjjAGEH8VQkR3A" rel="nofollow ugc"><span itemprop="name" class="p-name">
							wow i sure hope nobody notices i got the title wrong and edited it after posting</span></a>
					by <span itemprop="author" itemscope="" itemtype="https://schema.org/Person" class="h-card p-author vcard"><span itemprop="name" class="p-name fn n">Seirdy</span></span><p role="doc-tip" itemprop="accessibilitySummary">This comment may have major formatting errors that could impact screen reader comprehension.</p><p><q itemprop="text" class="p-content">wow i sure hope nobody notices i got the title wrong and edited it after posting</q></p>
			</dd>
						</div>
					</dl>
				</details>
				<p>Feel free to contact me directly with feedback; <a href="https://seirdy.one/about/#location-seirdy-online">here’s my contact info</a></p>
			</footer>
		</article>
	</main>
	<hr/>
	<aside aria-labelledby="continue-hd">
		<nav aria-labelledby="continue-hd">
			<h2 id="continue-hd">Continue reading</h2>
			<ul>
				<li>Previous post: <a href="https://seirdy.one/posts/2024/04/04/mdn-ai-help-and-lucid-lies/" rel="prev">MDN’s AI Help and lucid lies</a></li>
				<li>Next post: <a href="https://seirdy.one/posts/2024/09/25/post-ocsp-revocation/" rel="next">Post-OCSP certificate revocation in the Web PKI</a></li>
			</ul>
		</nav>
		<p>This place is not a place of honor. Opinions are those of your employer.</p>
		<p>For more information, please re-read.</p>
	</aside>
	<hr/>
	<footer>
		<nav aria-labelledby="bc-label" itemscope="" itemprop="breadcrumb" itemtype="https://schema.org/BreadcrumbList">
			<span id="bc-label">You are here: </span>
			<ol>
				<li itemscope="" itemprop="itemListElement" itemtype="https://schema.org/ListItem">
					<a itemprop="item" href="https://seirdy.one/posts/">
						<span itemprop="name">Articles</span>
					</a>
					<meta itemprop="position" content="1"/>
				</li>
				<li itemscope="" itemprop="itemListElement" itemtype="https://schema.org/ListItem">
					<a aria-current="page" itemprop="item" href="https://seirdy.one/posts/2024/05/30/google-document-warehouse-api-docs-leak/">
						<span itemprop="name">Takeaways from the Google Content Warehouse API documentation leak</span>
					</a>
					<meta itemprop="position" content="2"/>
				</li>
			</ol>
		</nav>
		<hr/>
		<p>
Copyright <time itemprop="copyrightYear" datetime="2024">2024</time> <span itemprop="author copyrightHolder" itemscope="" itemtype="https://schema.org/Person" itemid="https://seirdy.one/#seirdy" class="h-card p-author author vcard"><a itemprop="url" href="https://seirdy.one/" rel="author me home cc:attributionURL" class="u-url u-uid url" property="cc:attributionName"><img itemprop="image" width="16" height="16" alt="" src="https://seirdy.one/favicon.1250396055.png" class="u-photo photo"/> <span itemprop="name" class="p-name p-nickname nickname fn">Seirdy</span></a></span></p>
		<nav aria-label="site info">
			<ul>
				<li itemprop="license" itemscope="" itemtype="https://schema.org/CreativeWork">
					<a rel="license" itemprop="url" href="https://creativecommons.org/licenses/by-sa/4.0/">
						<span itemprop="name">CC BY-SA 4.0</span>
					</a>
				</li>
				<li>
					<a rel="source" href="https://sr.ht/~seirdy/seirdy.one/">Source code</a>
				</li>
				<li>
					<a rel="alternate" href="http://wgq3bd2kqoybhstp77i3wrzbfnsyd27wt34psaja4grqiezqircorkyd.onion/posts/2024/05/30/google-document-warehouse-api-docs-leak/">Tor</a>
				</li>
				<li>
					<a href="https://seirdy.one/meta/privacy/" rel="privacy-policy">Privacy</a>
				</li>
				<li>
					<a href="https://seirdy.one/meta/site-design/">Site design</a>
				</li>
			</ul>
		</nav>
		<hr/>
		<p>
			<a href="https://seirdy.one/meta/badges/">
				<img src="/p/b/sticker_88x31.3319174455.png" alt="88-by-31 button: my favicon, a white colon and semicolon on a black backround, next to the word Seirdy." width="88" height="31" class="pix"/>
			</a>
		</p>
	</footer>
</body>
</html>