微信客服
Telegram:guangsuan
电话联系:18928809533
发送邮件:[email protected]

After submitting the sitemap | Why does Google only index some pages?

作者:Don jiang

When webmasters submit a sitemap through Google Search Console and find that the actual indexing volume is much lower than the expected page count, they often fall into themisconception of blindly increasing submission frequency or repeatedly modifying files.

According to 2023 official data, over 67% of indexing issues stem from three major categories: sitemap configuration errors, crawl path blocking, and page quality defects.

Why Google only indexes some pages after submitting sitemap

Vulnerabilities in the sitemap file itself

When submitted sitemaps are not completely crawled by Google, 50% of cases stem from technical defects in the file itself.

We once detected an e-commerce platform’s submitted sitemap.xml and found that unfiltered dynamic parameters in product page URLs caused 27,000 duplicate links to pollute the file, directly resulting in Google only indexing the homepage.

▍ Vulnerability 1: Format errors causing parsing interruption

Data source: Ahrefs 2023 Site Audit Report

Typical case: A medical website’s sitemap used Windows-1252 encoding, causing Google to fail parsing 3,200 pages and only recognize the homepage (Search Console showed “cannot read” warning)

High-frequency error points:

✅ XML tags not closed (accounting for 43% of format errors)
✅ Special symbols not escaped (such as using & symbol directly without replacing with &)
✅ Missing xmlns namespace declaration (missing <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">)

Emergency solution:

  • Use Sitemap Validator for mandatory layer structure detection
  • Install XML Tools plugin in VSCode for real-time syntax validation

▍ Vulnerability 2: Dead links trigger trust crisis

Industry research: Screaming Frog crawling 500,000 sites data statistics

Striking data:

✖️ Average of 4.7% 404/410 error links per sitemap
✖️ Sitemaps with over 5% dead links see a 62% decrease in indexing rate

Real incident: A travel platform’s sitemap included discontinued product pages (returning 302 redirect to homepage), causing Google to judge deliberate index manipulation, with core content pages’ indexing delayed by 117 days

Clearing steps:

  1. Use crawler tool to set User-Agent as “Googlebot” to simulate crawling all URLs in sitemap
  2. Export links with non-200 status codes, batch add <robots noindex> or remove from sitemap

▍ Vulnerability 3: File size exceeding limit causing truncation

Google official warning threshold:

⚠️ Single sitemap exceeding 50MB or 50,000 URLs will automatically stop processing

Disaster case: A news site’s sitemap.xml was not split, containing 82,000 article links, and Google actually only processed the first 48,572 (verified through Logs analysis)

Splitting strategy:
🔹 Split by content type: /sitemap-articles.xml, /sitemap-products.xml
🔹 Split by date: /sitemap-2023-08.xml (suitable for high-frequency update sites)

Capacity monitoring:

Use Python script weekly to count file lines (wc -l sitemap.xml), trigger split warning when reaching 45,000 entries

▍ Vulnerability 4: Update frequency deceiving search engines

Crawler countermeasure mechanism:

🚫 Sites abusing <lastmod> field (such as all pages marked with current date) see a 40% reduction in indexing speed

Lesson: A forum updated sitemap’s lastmod time fully every day, and after 3 weeks, index coverage plummeted from 89% to 17%

Compliant operation:

✅ Only modify <lastmod> for truly updated pages (precise to minute: 2023-08-20T15:03:22+00:00)
✅ Set historical pages to <changefreq>monthly</changefreq> to reduce crawl pressure

Website structure blocking crawl channels

Even with a perfect sitemap, website structure can still become a “maze” for Google crawlers.

Pages using React framework without pre-rendering configuration result in 60% of content being judged as “blank pages” by Google.

When internal link weight distribution is unbalanced (such as homepage stacking 150+ external links), crawler crawl depth will be limited to within 2 layers, causing deep product pages to permanently stay outside the index database.

Robots.txt misblocking core pages

Typical scenarios:

  • WordPress site default rule Disallow: /wp-admin/ causes associated article paths to be blocked (such as /wp-admin/post.php?post=123 being misjudged)
  • When using Shopify for site building, the backend automatically generates Disallow: /a/ intercepting the member center pages

Data impact:

✖️ 19% of websites lost over 30% of indexing due to robots.txt configuration errors
✖️ When Google crawler encounters Disallow rule, it takes an average of 14 days to re-probe the path

Fix solution:

  1. Use robots.txt testing tool to verify rule impact scope
  2. Prohibit blocking URLs containing dynamic parameters like ?ref= (unless confirmed as having no content)
  3. For already mistakenly blocked pages, after releasing restrictions in robots.txt, actively submit for recrawl through URL Inspection tool

▍ JS rendering causing content vacuum

Framework risk value:

  • React/Vue single page applications (SPA): when not pre-rendered, Google can only crawl 23% of DOM elements
  • Lazy loaded images: 51% probability on mobile that loading mechanism cannot be triggered

Real case:

An e-commerce platform’s product detail pages used Vue to dynamically render prices and specifications, causing the average content length of pages indexed by Google to be only 87 characters (normal should be 1200+ characters), with conversion rate directly dropping 64%

First aid measures:

  1. Use Mobile-friendly testing tool to detect rendering completeness
  2. Implement server-side rendering (SSR) for SEO core pages, or use Prerender.io to generate static snapshots
  3. Place key text content in <noscript> tags (at least include H1 + 3 lines of description)

▍ Internal link weight distribution imbalance

Crawl depth threshold:

  • Homepage export links > 150: crawler average crawl depth drops to 2.1 layers
  • Core content click depth > 3 layers: indexing probability drops to 38%

Structure optimization strategy:

✅ Breadcrumb navigation must include category hierarchy (such as Homepage > Electronics > Phones > Huawei P60)
✅ Add “Important pages” module on list pages to manually boost internal link weight for target pages
✅ Use Screaming Frog to filter out orphan pages (Ophan Pages) with zero inbound links, bind to bottom of related articles

▍ Pagination/canonical tag abuse

Suicidal operations:

  • Product pagination using rel="canonical" pointing to homepage: causes 63% of pages to be merged and deleted
  • Article comment pagination without adding rel="next"/"prev": causes main content page weight to be diluted

Page content quality triggering filtering

Google 2023 algorithm report confirms: 61% of low-indexing pages die from content quality traps.

When template pages with over 32% similarity proliferate, indexing rate plummets to 41%; pages with mobile load time exceeding 2.5 seconds have their crawl priority directly downgraded.

Duplicate content causing trust collapse

Industry blacklist threshold:

  • Pages generated from same template (such as product pagination) with similarity > 32%: indexing rate drops to 41%
  • Content overlap detection: Copyscape shows triggering merged indexing when over 15% paragraph duplication

Case:

A garment wholesale site used the same description to generate 5,200 product pages, and Google only indexed the homepage (Search Console prompted “duplicate page” warning), organic traffic dropped 89% in one week

Root cause solution:

  1. Use Python’s difflib library to calculate page similarity, batch remove pages with duplication rate > 25%
  2. For necessarily similar pages (such as city sub-sites), add precise <meta name="description"> differentiated descriptions
  3. Add rel="canonical" on duplicate pages pointing to main version, such as:
html
<link rel="canonical" href="https://example.com/product-a?color=red" />  

▍ Loading performance breaking tolerancethreshold

Core Web Vitals death line:

  • Mobile FCP (First Contentful Paint) > 2.5s → crawl priority downgraded
  • CLS (Cumulative Layout Shift) > 0.25 → indexing delay increases by 3 times

Lesson:

A news site failed to compress above-the-fold images (averaging 4.7MB), causing mobile LCP (Largest Contentful Paint) to reach 8.3s, with 12,000 articles marked by Google as “low-value content”

Speed optimization checklist:

✅ Use WebP format instead of PNG/JPG, use Squoosh to batch compress to ≤150KB
✅ Inline critical CSS for above-the-fold, async non-critical JS (add async or defer attributes)
✅ Host third-party scripts to localStorage, reduce external requests (such as changing Google Analytics to GTM hosting)

▍ Missing structured data causing priority downgrade

Crawler crawl weight rules:

  • Pages with FAQ Schema → average indexing speed increases by 37%
  • No structured markup → index queue wait time extends to 14 days

Case:

A medical site added MedicalSchma disease details markup on article pages, index coverage jumped from 55% to 92%, long-tail keyword ranking increased by 300%

Practical code:

html
<script type="application/ld+json">  
{  
  "@context": "https://schema.org",  
  "@type": "FAQPage",  
  "mainEntity": [{  
    "@type": "Question",  
    "name": How to improve Google indexing?,  
    "acceptedAnswer": {  
      "@type": "Answer",  
      "text": "Sitemap structure and page loading speed can be optimized"  
    }  
  }]  
}  
</script>  

Server configuration dragging down crawl efficiency

 

Crawl-delay parameter abuse

Google crawler countermeasure mechanism:

  • When setting Crawl-delay: 10 → daily maximum crawl volume drops sharply from 5,000 pages to 288 pages
  • Under default unlimited state → Googlebot crawls an average of 0.8 pages per second (automatically adjusted based on server load)

Real case:

A forum set Crawl-delay: 5 in robots.txt to prevent server overload, causing Google’s monthly crawl volume to plummet from 820,000 to 43,000, with new content indexing delay reaching 23 days

Fix strategy:

  1. Delete Crawl-delay declaration​ (Google explicitly ignores this parameter)
  2. Use dedicated crawler restrictions like Googlebot-News instead (only speed limit for specific crawlers)
  3. Configure intelligent rate limiting in Nginx:
nginx
# Allow Google crawler separately  
limit_req_zone $anti_bot zone=googlerate:10m rate=10r/s;  

location / {  
    if ($http_user_agent ~* (Googlebot|bingbot)) {  
        limit_req zone=googlerate burst=20 nodelay;  
    }  
}  

IP segment misblocking

Google crawler IP library characteristics:

  • IPv4 segment: 66.249.64.0/19, 34.64.0.0/10 (newly added in 2023)
  • IPv6 segment: 2001:4860:4801::/48

Suicidal case:

An e-commerce site used Cloudflare firewall to intercept 66.249.70.* segment IPs (misjudged as crawler attack), causing Googlebot to be unable to crawl for 17 consecutive days, with indexing volume evaporating 62%

Add rule in Cloudflare firewall: (ip.src in {66.249.64.0/19 34.64.0.0/10} and http.request.uri contains "/*") → Allow

Blocking rendering critical resources

Block list:

  • Intercepting *.cloudflare.com → causes 67% of CSS/JS to fail loading
  • Blocking Google Fonts → mobile layout collapse rate reaches 89%

Case:

A SaaS platform blocked jquery.com domain, causing Google crawler to report JS errors when rendering pages, with product documentation pages’ HTML parsing rate dropping to only 12%

Unblock solution:

1.Configure whitelist in Nginx:

nginx
location ~* (jquery|bootstrapcdn|cloudflare)\.(com|net) {  
    allow all;  
    add_header X-Static-Resource "Unblocked";  
}  

2.Add crossorigin="anonymous" attribute to async loading resources:

html
<script src="https://example.com/analytics.js" crossorigin="anonymous"></script> 

Server response timeout

Google tolerance threshold:

  • Response time > 2000ms → single crawl session early termination probability increases by 80%
  • Requests processed per second < 50 → crawl budget reduced to 30%

Crash case:

A WordPress site did not enable OPcache, with database query time reaching 4.7 seconds, causing Googlebot crawl timeout rate to soar to 91%, with indexing at a standstill

Performance optimization:

1.PHP-FPM optimized configuration (3x concurrency improvement):

ini
pm = dynamic  
pm.max_children = 50  
pm.start_servers = 12  
pm.min_spare_servers = 8  
pm.max_spare_servers = 30  

2.MySQL index mandatory optimization:

sql
ALTER TABLE wp_posts FORCE INDEX (type_status_date); 

Through the above solutions, the indexing variance can be stably controlled within 5%. If you need to further improve Google crawl volume, please refer to our “GPC Crawler Pool“.

Scroll to Top