This document details the execution and output of Step 2 of the "Site SEO Auditor" workflow: hive_db → diff. This crucial step is responsible for comparing your latest site audit results against previous reports stored in our database, providing you with a clear, actionable overview of changes over time.
hive_db → diffThis step performs a comprehensive comparison between your site's most recent SEO audit report and its preceding audit report, both securely stored in our MongoDB SiteAuditReport collection. The primary goal is to identify and highlight all significant changes, improvements, and regressions across your website's SEO health metrics.
The core objectives of the hive_db → diff step are to:
Upon completion of the site crawling and auditing process (Step 1), a new SiteAuditReport is generated and stored in MongoDB. The hive_db → diff step then executes the following sequence:
SiteAuditReport (from the current audit run) is fetched from the database.SiteAuditReport for your site is retrieved from the database.* Page-Level Comparison: Each URL present in both reports is compared point-by-point for every SEO checklist item.
* New/Removed Pages: The system identifies pages that are new to the current audit or pages that were present in the previous audit but are no longer found.
* Improvements: Metrics that have moved from a "failing" or "poor" state to a "passing" or "good" state.
* Regressions: Metrics that have moved from a "passing" or "good" state to a "failing" or "poor" state.
* New Issues: Problems identified in the current audit that were not present or detected in the previous one.
* Resolved Issues: Problems from the previous audit that are no longer present in the current one.
* Unchanged: Metrics that remain consistent between audits.
SiteAuditReport document in MongoDB, ensuring that each audit report contains its historical comparison data.The diffing process meticulously compares the following 12-point SEO checklist items for every audited page:
* Largest Contentful Paint (LCP): Performance score (Good, Needs Improvement, Poor).
* Cumulative Layout Shift (CLS): Performance score (Good, Needs Improvement, Poor).
* First Input Delay (FID): Performance score (Good, Needs Improvement, Poor).
The output of this step is a comprehensive "Before/After Diff Report" integrated directly into your latest SiteAuditReport. This report will be presented in a clear, hierarchical format, allowing for quick identification of critical changes.
* Newly discovered pages.
* Pages no longer found (removed or redirected).
* Pages with improvements.
* Pages with regressions.
For each page that has experienced a change, the report will provide:
* Meta Title: Before: "Old Title" -> After: "New Title" (if changed), or Status: Missing -> Present.
* H1 Tag: Status: Present (OK) -> Missing (ERROR).
* Image Alt Coverage: Before: 80% -> After: 95% (IMPROVEMENT).
* LCP Score: Before: Poor -> After: Good (IMPROVEMENT).
* Broken Elements: List of specific broken elements (e.g., https://example.com/broken-img.jpg) that are New or Resolved.
* Canonical Tag: Before: Missing -> After: Present (IMPROVEMENT).
### Site Audit Diff Report: Current (YYYY-MM-DD HH:MM) vs. Previous (YYYY-MM-DD HH:MM)
**Overall Site Health Summary:**
* **Overall Score:** +5% (Improved)
* **New Issues:** 12
* **Resolved Issues:** 25
* **Pages with Improvements:** 15
* **Pages with Regressions:** 3
* **New Pages Discovered:** 2
* **Pages No Longer Found:** 1
---
**Page-Level Changes:**
**1. URL: `https://yourdomain.com/product-category/new-widget`**
* **Status:** **New Page Discovered**
* **Key Issues:** Missing Meta Description, LCP: Needs Improvement
**2. URL: `https://yourdomain.com/blog/article-about-seo`**
* **Status:** **Improved**
* **Meta Description:** `Before: Missing -> After: Present (IMPROVEMENT)`
* **Image Alt Coverage:** `Before: 70% -> After: 100% (IMPROVEMENT)`
* **LCP Score:** `Before: Needs Improvement -> After: Good (IMPROVEMENT)`
* **Structured Data:** `Before: Invalid Schema -> After: Valid Schema (IMPROVEMENT)`
**3. URL: `https://yourdomain.com/homepage`**
* **Status:** **Regression**
* **H1 Tag:** `Before: Present (OK) -> After: Missing (ERROR)`
* **CLS Score:** `Before: Good -> After: Needs Improvement (REGRESSION)`
* **Broken Elements:** `New Broken Link: https://yourdomain.com/old-page (ERROR)`
**4. URL: `https://yourdomain.com/contact`**
* **Status:** **No Significant Change** (All metrics within acceptable thresholds, no new or resolved issues)
... (Additional pages listed as necessary)
This document details the execution and output of the initial crawling phase for your "Site SEO Auditor" workflow. This foundational step is critical for accurately assessing your website's SEO performance, as it systematically visits and captures data from every accessible page on your site using a headless browser.
The primary objective of this step is to:
We leverage Puppeteer, a Node.js library developed by Google, to control a headless instance of the Chrome browser.
* JavaScript Execution: Unlike traditional HTTP crawlers that only fetch static HTML, Puppeteer renders pages in a full browser environment. This ensures that all content generated or modified by JavaScript (e.g., dynamic product listings, blog comments, interactive elements) is fully present and visible, accurately reflecting what search engines and users see.
* Real User Simulation: It simulates a real user's browser experience, allowing us to capture how your site behaves and appears under actual browsing conditions.
* Comprehensive Data Capture: Puppeteer enables us to extract not just the raw HTML, but also the fully constructed DOM, network requests, performance metrics, and even screenshots if required for debugging.
The crawl is executed with precision and care to ensure thoroughness while maintaining server politeness.
https://www.yourdomain.com/).* Each page is loaded by Puppeteer and allowed to fully render, waiting for network idle conditions to ensure all dynamic content has settled.
* The fully rendered DOM is then parsed to extract all internal <a> (anchor) tags linking to unique, unvisited URLs within your domain.
* These newly discovered URLs are added to an intelligent queue for subsequent processing. External links are noted but not followed to keep the audit focused on your domain.
* Sitemap Integration: For enhanced coverage, the crawler also consults your sitemap.xml (if available) to ensure all declared pages are included in the crawl, even if they might not be immediately discoverable via internal linking alone.
* HTTP Errors: Identification and logging of 404 (Not Found), 500 (Server Error), and other HTTP status codes.
* Network Timeouts: Handling of pages that fail to load within a specified timeframe.
* JavaScript Errors: Capture of any client-side JavaScript errors or warnings emitted in the browser console during page rendering.
* All errors are logged with their associated URLs for later review and potential remediation.
For every successfully visited and rendered unique URL, the following detailed data points are collected:
* Largest Contentful Paint (LCP): The render time of the largest image or text block visible within the viewport.
* Cumulative Layout Shift (CLS): A score quantifying unexpected layout shifts during the page's lifecycle.
Upon completion, this step delivers a structured dataset comprising:
The collected raw data from this Puppeteer-driven crawl is immediately passed to Step 2: Auditor → Analyze. In this subsequent step, the raw HTML, DOM, and performance metrics will be meticulously parsed and analyzed against the 12-point SEO checklist, identifying specific issues and preparing them for automated fix generation.
This detailed diff report provides you with immediate insights into your site's SEO evolution. You can use this information to:
The generated diff report, now stored within your latest SiteAuditReport, will be used in the subsequent steps of the workflow:
gemini → fix): Identified "broken elements" or critical regressions will be sent to Gemini for AI-driven generation of exact fixes.report → notify): A comprehensive report, including this diff analysis and any generated fixes, will be compiled and delivered to you via your preferred notification channels.store → archive): The final SiteAuditReport with the embedded diff will be archived for long-term historical tracking.This document details the successful execution and deliverables for Step 3 of the Site SEO Auditor workflow: AI-Powered Fix Generation using Gemini (batch_generate).
Following the comprehensive crawl and audit of your website in Step 2, our system has identified specific SEO elements that are either missing, incorrect, or sub-optimal according to our 12-point SEO checklist. Step 3 leverages Google's advanced AI model, Gemini, to meticulously analyze each identified issue and generate precise, actionable fixes.
This step transforms raw audit findings into concrete, implementable solutions, significantly streamlining the process of improving your site's SEO health.
* From the extensive SiteAuditReport generated in Step 2, our system extracts every identified "broken element" or "recommendation." This includes specific details such as:
* Page URL: The exact URL where the issue was found.
* SEO Element Type: (e.g., meta_title, meta_description, H1_tag, image_alt, canonical_tag, open_graph_tag, structured_data, viewport).
* Problem Description: A clear explanation of the issue (e.g., "Meta title missing," "H1 tag not found," "Image missing alt attribute," "Canonical tag points to self-referencing non-canonical URL," "CLS score too high").
* Contextual Data: Relevant surrounding HTML, page content snippets, or performance metrics that provide Gemini with the necessary context.
* The collected issues are then batched and fed to the Gemini AI model.
* Intelligent Analysis: Gemini processes each issue by:
* Understanding SEO Best Practices: Applying its vast knowledge of current SEO guidelines, search engine algorithms, and user experience principles.
* Contextual Understanding: Analyzing the specific page content, existing metadata, and identified problems to ensure fixes are relevant and effective for that particular page.
* Problem-Solving Logic: Determining the root cause of the issue and formulating the most appropriate corrective action.
* For each identified issue, Gemini generates an exact, actionable fix. These fixes are designed to be easily understood and implemented by your development or content team.
* Example Fixes Generated:
* Missing Meta Title: Suggests a concise, keyword-rich title (e.g., <title>Product Name - Category | Your Brand</title>).
* Missing Meta Description: Crafts a compelling, informative description encouraging clicks (e.g., <meta name="description" content="Discover our wide range of [product category]. Shop now for [key benefits] and [unique selling points].">).
* Missing H1 Tag: Proposes a clear, descriptive H1 based on page content (e.g., <h1>Main Product Category Page</h1>).
* Image Missing Alt Attribute: Suggests descriptive alt text based on the image's context and surrounding text (e.g., <img src="product.jpg" alt="Blue denim jacket for men">).
* Incorrect Canonical Tag: Recommends the correct canonical URL (e.g., <link rel="canonical" href="https://www.yourdomain.com/product-page/">).
* Missing Open Graph Tags: Provides complete OG tags for social sharing (e.g., <meta property="og:title" content="Page Title">, <meta property="og:image" content="https://...">).
* Core Web Vitals Improvement (LCP/CLS): Suggests specific code-level or configuration changes (e.g., "Preload largest contentful paint image: <link rel="preload" href="path/to/hero-image.jpg" as="image">", "Identify and optimize layout-shifting elements: CSS property aspect-ratio or explicit width/height attributes for images/iframes").
* Missing Mobile Viewport: Inserts the standard viewport meta tag (e.g., <meta name="viewport" content="width=device-width, initial-scale=1">).
The primary deliverable of this step is a comprehensive collection of Generated Fixes, formatted for clarity and direct implementation.
* page_url: The URL where the fix should be applied.
* seo_element: The specific SEO element being addressed.
* problem_description: A restatement of the original issue.
* suggested_fix: The exact code snippet or actionable instruction generated by Gemini.
* fix_type: (e.g., add_tag, update_attribute, insert_code, configuration_change).
* confidence_score: (Optional) An AI-generated confidence score for the proposed fix.
The generated fixes are now stored alongside the audit results in MongoDB. In Step 4, we will finalize the SiteAuditReport document, incorporating these fixes to create a complete before/after diff for each identified issue. This comprehensive report will then be ready for your review and implementation, enabling you to track the impact of these improvements on your site's SEO performance.
This step is critical for securely storing the comprehensive SEO audit findings and their corresponding remediation strategies within your dedicated hive_db. We are performing an upsert operation, which intelligently handles both initial data insertion and subsequent updates to ensure your audit history is accurately maintained.
Following the successful crawling, SEO checklist validation, Core Web Vitals assessment, and Gemini's generation of precise fixes for identified issues, all this valuable data is consolidated into a SiteAuditReport document. This document is then persisted into your MongoDB instance. The upsert operation ensures that:
hive_db Upserthive_db (Your dedicated MongoDB database instance).site_audit_reports (A new collection specifically designed to store your SEO audit history).SiteAuditReport Document StructureThe following detailed structure outlines the SiteAuditReport document that will be stored in the site_audit_reports collection. This comprehensive model ensures all aspects of your SEO audit are captured and easily retrievable.
{
"_id": ObjectId("..."), // MongoDB's unique document ID
"siteUrl": "https://www.example.com", // The root URL of the site audited
"auditTimestamp": ISODate("2023-10-27T02:00:00.000Z"), // Timestamp of when the audit was initiated
"auditId": "uuid-for-this-audit-run", // Unique identifier for each specific audit run
"auditStatus": "completed", // e.g., "completed", "in_progress", "failed", "cancelled"
"overallScore": 85, // An aggregated SEO score (optional, can be derived)
"pagesAuditedCount": 150, // Total number of unique pages crawled and audited
"auditDetails": [ // Array of detailed results for each audited page
{
"pageUrl": "https://www.example.com/some-page",
"crawlStatus": "success", // e.g., "success", "error", "skipped"
"statusCode": 200, // HTTP status code of the page
"redirectedTo": null, // If redirected, the final URL
"seoChecklistResults": { // Results for the 12-point SEO checklist
"metaTitle": {
"present": true,
"unique": true,
"length": 55,
"value": "Your Page Title - Keyword"
},
"metaDescription": {
"present": true,
"unique": true,
"length": 150,
"value": "Detailed description of your page content."
},
"h1Presence": {
"present": true,
"count": 1,
"value": "Main Heading of the Page"
},
"imageAltCoverage": {
"totalImages": 10,
"imagesWithAlt": 8,
"missingAltImages": [
{"src": "/img/broken1.jpg", "reason": "Missing alt text"},
{"src": "/img/broken2.png", "reason": "Empty alt text"}
],
"coveragePercentage": 80
},
"internalLinkDensity": {
"totalInternalLinks": 25,
"uniqueInternalLinks": 18,
"densityScore": 75 // A calculated score or count
},
"canonicalTag": {
"present": true,
"valid": true,
"value": "https://www.example.com/some-page"
},
"openGraphTags": {
"ogTitle": {"present": true, "value": "OG Title"},
"ogDescription": {"present": true, "value": "OG Description"},
"ogImage": {"present": true, "value": "https://.../og-image.jpg"},
"allPresent": true // True if all essential OG tags are found
},
"structuredData": {
"present": true,
"schemaTypes": ["Article", "BreadcrumbList"],
"validationIssues": [] // Array of issues found by schema validator
},
"mobileViewport": {
"present": true,
"valid": true,
"content": "width=device-width, initial-scale=1.0"
}
// ... other checklist items
},
"coreWebVitals": { // Performance metrics
"LCP": 2.5, // Largest Contentful Paint (seconds)
"CLS": 0.05, // Cumulative Layout Shift
"FID": 50, // First Input Delay (milliseconds)
"performanceScore": 90 // An aggregated Lighthouse/performance score
},
"brokenElements": [ // Issues requiring fixes, identified by the crawler
{
"type": "image",
"selector": "img[src='/img/broken1.jpg']",
"issue": "Missing alt text",
"severity": "medium",
"context": "<img src='/img/broken1.jpg' />"
},
{
"type": "h1",
"selector": "body",
"issue": "No H1 tag found",
"severity": "high",
"context": "<body>...</body>"
}
],
"geminiFixes": [ // Exact fixes generated by Gemini for broken elements
{
"issueType": "image_alt_text",
"originalElement": "<img src='/img/broken1.jpg' />",
"suggestedFix": {
"action": "update_attribute",
"selector": "img[src='/img/broken1.jpg']",
"attribute": "alt",
"value": "Descriptive alt text for image 1"
},
"explanation": "Adding descriptive alt text improves accessibility and SEO for screen readers and image search engines."
},
{
"issueType": "missing_h1",
"originalElement": "<body>...</body>",
"suggestedFix": {
"action": "insert_element",
"selector": "body",
"position": "after_opening_tag",
"element": "<h1>Main Title of the Page</h1>"
},
"explanation": "A unique and descriptive H1 tag is crucial for SEO, indicating the main topic of the page to search engines."
}
]
}
// ... more page audit details
],
"previousAuditId": "uuid-of-previous-audit-run", // Link to the previous audit report for diffing
"diffSummary": { // High-level summary of changes since the last audit
"newIssuesFound": 5,
"issuesResolved": 3,
"scoreChange": "+5", // e.g., "+5", "-2"
"pageCountChange": "+10",
"changedPages": ["https://www.example.com/page-a", "https://www.example.com/page-b"]
},
"createdAt": ISODate("2023-10-27T02:00:00.000Z"), // Timestamp of document creation
"updatedAt": ISODate("2023-10-27T02:05:00.000Z") // Timestamp of last document update
}
The upsert operation uses a unique key to identify whether a document already exists. For SiteAuditReport documents, the combination of siteUrl and auditId (or auditTimestamp if auditId is not used for primary identification) will serve as the unique identifier.
{ "siteUrl": "https://www.example.com", "auditId": "uuid-for-this-audit-run" }SiteAuditReport document described above.upsert: trueThis ensures that each specific audit run for a given site is either created or updated, providing a clear and traceable history of your site's SEO performance.
previousAuditId to generate a meaningful "before/after" diff, highlighting changes.Upon successful completion of this upsert operation, the full SiteAuditReport will be securely stored in your hive_db. This concludes the data processing and storage phase.
The final step (Step 5 of 5) will involve presenting these results in a user-friendly format, potentially triggering notifications, and providing access to the detailed audit report and generated fixes.
hive_db → conditional_update for Site SEO AuditorThis final step in the "Site SEO Auditor" workflow is critical for persistent storage, historical tracking, and delivering actionable insights to you. It involves intelligently updating your MongoDB database (hive_db) with the comprehensive SEO audit results and the generated fixes, including a detailed before/after comparison.
The conditional_update step serves several key purposes:
before/after diff against the previous audit report, highlighting changes and the impact of implemented fixes.SiteAuditReport in MongoDBAll audit results are meticulously structured and stored in a new document within the site_audit_reports collection in MongoDB, following a robust schema designed for comprehensiveness and easy querying.
SiteAuditReport_id: Unique identifier for each audit report.siteUrl: The base URL of the audited website (e.g., https://example.com).auditDate: Timestamp of when the audit was completed.triggerType: Indicates if the audit was scheduled (automatic) or manual (on-demand).previousAuditId: References the _id of the immediately preceding audit report for the same site, crucial for diff generation.overallStatus: A high-level assessment (e.g., Pass, Needs Improvement, Critical Issues).overallScore: A calculated score reflecting the site's overall SEO health.summary: High-level statistics and a brief overview of critical issues.pages: An array of detailed audit results for each page crawled.diffSummary: A high-level summary of changes compared to the previousAuditId.detailedDiff: A granular, page-by-page and metric-by-metric comparison with the previous audit.SiteAuditReport Document
{
"_id": "65e8a0b0c1d2e3f4a5b6c7d8",
"siteUrl": "https://www.yourwebsite.com",
"auditDate": ISODate("2024-03-07T02:00:00.000Z"),
"triggerType": "scheduled",
"previousAuditId": "65e1b2c3d4e5f6a7b8c9d0e1", // Reference to the previous week's audit
"overallStatus": "Needs Improvement",
"overallScore": 78,
"summary": {
"totalPagesAudited": 150,
"criticalIssuesDetected": 5,
"warningsDetected": 12,
"pagesWithGeminiFixes": 7
},
"pages": [
{
"pageUrl": "https://www.yourwebsite.com/",
"status": "Needs Improvement",
"metrics": {
"metaTitle": {
"value": "Your Website - Home Page",
"status": "Pass",
"issue": null,
"geminiFix": null
},
"metaDescription": {
"value": "Welcome to Your Website, offering...",
"status": "Fail",
"issue": "Description too short (50 chars). Recommended: 150-160 chars.",
"geminiFix": "Rewrite: 'Discover our wide range of products and services. We are dedicated to providing high-quality solutions tailored to your needs. Learn more about what makes us stand out.'"
},
"h1Presence": {
"value": "Welcome to Our Site",
"status": "Pass",
"issue": null,
"geminiFix": null
},
"imageAltCoverage": {
"status": "Fail",
"issue": "2 images missing alt text.",
"details": [
{ "src": "/img/hero.jpg", "currentAlt": "" },
{ "src": "/img/logo.png", "currentAlt": "" }
],
"geminiFix": "For /img/hero.jpg: Add alt='Hero image depicting [description]'. For /img/logo.png: Add alt='Your Website Logo'."
},
"internalLinkDensity": {
"count": 25,
"status": "Pass",
"issue": null
},
"canonicalTag": {
"value": "https://www.yourwebsite.com/",
"status": "Pass",
"issue": null,
"geminiFix": null
},
"openGraphTags": {
"status": "Fail",
"issue": "og:image and og:description missing.",
"details": {
"ogTitle": "Your Website",
"ogUrl": "https://www.yourwebsite.com/"
},
"geminiFix": "Add <meta property='og:image' content='[URL to image]'> and <meta property='og:description' content='[Concise description for social sharing]'>."
},
"coreWebVitals": {
"lcp": 2.8, // seconds
"cls": 0.05,
"fid": 0.03, // seconds
"status": "Pass",
"issue": null
},
"structuredDataPresence": {
"status": "Pass",
"details": ["Schema.org/WebSite", "Schema.org/Organization"],
"issue": null
},
"mobileViewport": {
"status": "Pass",
"issue": null
}
}
},
// ... more page objects
],
"diffSummary": {
"metricsImproved": ["metaDescription", "imageAltCoverage"],
"metricsDegraded": ["coreWebVitals"],
"newIssues": ["https://www.yourwebsite.com/blog/new-post - Missing H1"],
"fixedIssues": ["https://www.yourwebsite.com/about - Missing Canonical Tag"]
},
"detailedDiff": {
"https://www.yourwebsite.com/": {
"metaDescription": {
"before": "Old short description.",
"after": "New longer description.",
"change": "Improved"
},
"coreWebVitals": {
"lcp": {
"before": 2.1,
"after": 2.8,
"change": "Degraded"
}
}
}
// ... more detailed diffs for other pages and metrics
}
}
The conditional_update process is intelligently designed to handle both initial audits and subsequent recurring checks:
SiteAuditReport for the siteUrl being audited. * If no prior audit report exists for the given siteUrl, the current audit results are inserted directly as a brand-new SiteAuditReport document.
* The previousAuditId field will be null.
* diffSummary and detailedDiff fields will also be null or empty, as there's nothing to compare against.
* If a previousAuditReport is found, the system proceeds to generate the before/after diff.
* A new SiteAuditReport document is created for the current audit. This new document will include all current audit results, the calculated diffSummary, and detailedDiff.
The previousAuditId field in this new* document will be populated with the _id of the retrieved previous audit report, establishing a clear historical link.
* This approach ensures that every audit run creates an immutable snapshot, providing a complete and traceable history of your site's SEO evolution.
The core value of the conditional_update step lies in its ability to generate a meaningful comparison between consecutive audits.
diffSummary)This section provides a quick overview of significant changes across the entire site:
metaDescription, H1Presence, LCP) that have shown a notable improvement or degradation across the site.detailedDiff)This provides granular, page-specific comparisons for every audited metric. For each page and each metric, it records:
before value: The state of the metric from the previousAuditReport.after value: The state of the metric from the current audit.change indicator: A qualitative assessment (e.g., Improved, Degraded, No Change, New Issue, Fixed Issue).The geminiFix suggestions generated in the previous step are stored directly within the pages array, associated with the specific metric and issue they address. This ensures that when you review an audit report, the recommended fix is immediately available alongside the identified problem.
When generating the diff, if a geminiFix from a previous report is no longer needed (because the issue is resolved), this will be noted in the fixedIssues within the diffSummary. If a new issue arises, a new geminiFix will be generated and stored.
The system ensures seamless integration with both scheduled and on-demand audit triggers:
conditional_update step stores the report, linking it chronologically to the previous week's report.previousAuditId, a comprehensive chain of audit history is established for your site, enabling deep historical analysis and long-term trend monitoring.Upon successful completion of this step, the following direct deliverables and benefits are realized:
SiteAuditReport document is available in your hive_db (MongoDB) for the audited website, containing all 12 SEO checklist points, Core Web Vitals, and structured data presence.You will typically interact with this data through a user-friendly interface that visualizes these reports, rather than directly querying MongoDB.
With the data now securely stored and intelligently structured in MongoDB:
SiteAuditReport documents\n