SiteAuditReport UpsertThis step is critical for the "Site SEO Auditor" workflow as it ensures that all the valuable SEO audit data, meticulously collected and analyzed in the previous steps, is securely stored and made accessible for historical tracking, reporting, and comparison. The hive_db performs an upsert operation, intelligently updating existing audit records or creating new ones as needed.
The primary goal of this step is to persistently store the comprehensive SiteAuditReport generated by the headless crawler and Gemini AI into your dedicated MongoDB database (hive_db). This action serves several key functions:
SiteAuditReport Data ModelThe SiteAuditReport is a comprehensive document designed to capture every detail of the SEO audit. Each report is uniquely identified and linked to your specific site.
auditId (String, Primary Key): A unique identifier for each audit run (e.g., audit_yourdomain_com_20231027T020000Z).siteUrl (String): The root URL of the site that was audited (e.g., https://www.yourdomain.com).timestamp (Date): The exact date and time the audit was completed.status (String): The overall status of the audit (completed, failed, partial).totalPagesAudited (Number): The total number of unique pages successfully crawled and audited.overallScore (Number, 0-100): An aggregated score representing the site's overall SEO health based on the checklist.siteLevelIssues (Array of Objects): Critical issues affecting the entire site (e.g., robots.txt disallowing critical pages, global noindex tag).pages Array of Objects):This array contains a detailed breakdown for each individual page crawled on your site.
pageUrl (String): The full URL of the audited page.statusCode (Number): HTTP status code returned by the page (e.g., 200, 301, 404).seoMetrics (Object): * metaTitle (Object):
* value (String)
* length (Number)
* isUnique (Boolean)
* issue (String, if not unique/missing/too long)
* metaDescription (Object):
* value (String)
* length (Number)
* isUnique (Boolean)
* issue (String, if not unique/missing/too long)
* h1Tag (Object):
* value (String)
* isPresent (Boolean)
* isUnique (Boolean)
* issue (String, if missing/multiple)
* imageAltCoverage (Object):
* totalImages (Number)
* imagesMissingAlt (Array of Strings - image URLs)
* coveragePercentage (Number)
* internalLinkDensity (Object):
* totalInternalLinks (Number)
* links (Array of Objects - href, anchorText)
* canonicalTag (Object):
* value (String, the canonical URL)
* isPresent (Boolean)
* isSelfReferencing (Boolean)
* isCorrect (Boolean)
* issue (String, if missing/incorrect)
* openGraphTags (Object):
* isPresent (Boolean)
* ogTitle (String)
* ogDescription (String)
* ogImage (String)
* issue (String, if missing critical tags)
* coreWebVitals (Object):
* LCP (Number, ms)
* CLS (Number, score)
* FID (Number, ms)
* performanceScore (Number, 0-100)
* issues (Array of Strings, e.g., 'LCP too high')
* structuredData (Object):
* isPresent (Boolean)
* types (Array of Strings, e.g., 'Schema.org/Article')
* isValid (Boolean)
* issue (String, if invalid/missing)
* mobileViewport (Object):
* isPresent (Boolean)
* isConfiguredCorrectly (Boolean)
* issue (String, if missing/incorrect)
issuesFound Array of Objects):This array consolidates all detected issues, along with the AI-generated fixes.
issueType (String): e.g., 'Missing Meta Title', 'Duplicate H1', 'High LCP'.pageUrl (String): The page where the issue was found.description (String): A detailed explanation of the issue.severity (String): Critical, High, Medium, Low.geminiFixSuggestion (String): The exact, actionable fix generated by Gemini AI.status (String): Open, Fixed, Ignored.diffWithPreviousAudit Object):This crucial section stores the changes identified by comparing the current audit's results with the most recent previous audit for the same site. This provides an immediate understanding of progress or new regressions.
previousAuditId (String): The auditId of the last completed audit.overallScoreChange (Number): Difference in overallScore (+ve for improvement, -ve for regression).newIssues (Array of Objects): Issues present in the current audit that were not in the previous one.resolvedIssues (Array of Objects): Issues present in the previous audit that are no longer present.changedMetrics (Array of Objects): Specific metrics that have significantly changed (e.g., LCP increased by >100ms on a specific page).The upsert operation in MongoDB is a powerful atomic command that combines an update and an insert operation.
SiteAuditReport document that matches the current audit's siteUrl and potentially a specific auditId if an update to an in-progress audit is occurring (though typically, a new auditId is generated for each run). For our recurring audits, the primary lookup for comparison is siteUrl.diffWithPreviousAudit), the existing document is updated with the new data.SiteAuditReport document is created and inserted into the site_audit_reports collection within hive_db.This ensures that:
auditId (if applicable) would result in the same database state, preventing duplicate records.find and insert or update operations, reducing database round trips.diffWithPreviousAudit, is stored as a single, cohesive document.Upon successful completion of this step, the hive_db will return a confirmation of the upsert operation.
Example Output (Internal Log):
{
"status": "success",
"operation": "upsert",
"collection": "site_audit_reports",
"auditId": "audit_yourdomain_com_20231027T020000Z",
"siteUrl": "https://www.yourdomain.com",
"matchedCount": 0,
"modifiedCount": 0,
"upsertedCount": 1,
"upsertedId": {
"$oid": "653b6f1a0b3e4f7a9d0b3e4f" // MongoDB generated ID for the new document
},
"message": "New SiteAuditReport document successfully inserted into hive_db."
}
This initial and foundational step of the "Site SEO Auditor" workflow is dedicated to comprehensively crawling your website. Utilizing Puppeteer, a headless browser automation library, we simulate a real user's journey through your site. The primary goal is to discover every accessible page and collect the raw HTML content and critical performance metrics required for a thorough SEO audit.
By mimicking a browser environment, Puppeteer ensures that dynamically loaded content (common in Single Page Applications or JavaScript-heavy sites) is fully rendered and captured, providing an accurate representation of what search engines and users actually see. This step lays the groundwork by furnishing the necessary data for the subsequent 12-point SEO checklist analysis.
Our crawling mechanism is engineered for robustness and accuracy:
* Starting from the designated seed URL(s), Puppeteer navigates to each page.
* Upon successful page load, it extracts all internal <a> tags (HREF attributes) to identify new pages within your domain.
* Discovered URLs are added to a queue for subsequent crawling, ensuring comprehensive site coverage.
* External links are identified but not traversed, focusing the audit on your owned properties.
For each page successfully crawled, the following essential data points are meticulously collected:
* <title> tag content
* <meta name="description"> content
* All <h1> tags
* All <img> tags (for alt attribute checks)
* All internal <a> tags (for link density and broken link checks)
* <link rel="canonical"> tags
* Open Graph (og:) meta tags
* Structured Data (JSON-LD, Microdata, RDFa)
* First Contentful Paint (FCP)
* Largest Contentful Paint (LCP) element identification and timing
* Cumulative Layout Shift (CLS) scores and events
* First Input Delay (FID) related metrics (simulated if direct interaction is not feasible in headless mode, or relying on TBT as a proxy).
The crawl is executed with the following default parameters, which can be customized based on your site's specific needs:
https://www.yourwebsite.com). Additional seed URLs can be provided for specific sections.robots.txt file are respected and excluded from the crawl. Custom exclusion patterns can also be added.Upon completion of the Puppeteer crawl, the following raw data will be securely stored in MongoDB as part of the SiteAuditReport document:
* The URL
* HTTP Status Code
* The complete raw HTML content
* Extracted performance metrics (LCP, CLS data)
* List of extracted internal links
* Any detected crawl errors or warnings
The data meticulously collected in this crawling phase is the essential foundation for the subsequent SEO audit. It will directly feed into the analytical engine that evaluates your site against the 12-point SEO checklist.
This step ensures that our audit is based on the most accurate and complete representation of your website, including all dynamic content, providing you with a reliable baseline for identifying SEO opportunities and issues.
hive_db → diff)This step is crucial for understanding your website's SEO evolution over time. After the headless crawler comprehensively audits every page, the newly generated SiteAuditReport is compared against the most recent previous report stored in your dedicated MongoDB instance (hive_db). This comparison generates a detailed "delta" or "diff" report, highlighting specific changes, improvements, and new issues.
The primary goal of this step is to provide a clear, actionable overview of your site's SEO performance trajectory. By comparing current audit results with past data, we can:
This step utilizes two key data sources:
SiteAuditReport: The complete audit results generated by the headless crawler (Step 1), containing detailed SEO data for every page visited.SiteAuditReport: The most recent, fully stored audit report retrieved from your hive_db (MongoDB). If no previous report exists (e.g., first-ever audit), this step will establish the baseline.The system performs a granular, page-by-page and site-wide comparison across all 12 SEO checklist points. The "diff" process identifies three primary types of changes:
Below is a detailed breakdown of how the comparison is performed for each SEO checklist item:
* Identifies pages with newly introduced duplicate titles or descriptions.
* Flags pages where previously duplicate titles/descriptions are now unique.
* Reports overall site-level changes in the count of unique titles and descriptions.
/product-a now has a duplicate meta title with /product-b. Page /blog/post-old no longer has a duplicate description."* Detects new pages missing an H1 tag.
* Confirms pages that previously lacked an H1 but now have one.
/new-landing). Resolved: 2 pages (e.g., /old-service) now have an H1."* Calculates the percentage of images missing alt text on a per-page and site-wide basis.
* Highlights pages where alt text coverage has significantly decreased or improved.
* Identifies specific images that are newly missing alt text or have had alt text added.
/gallery now has 5 new images missing alt text. Resolved: 12 images on /about page now have alt attributes."* Analyzes the average number of internal links per page and identifies pages with significant fluctuations.
* Detects pages that have newly become "orphaned" (0 internal links pointing to them) or have gained internal links.
/old-product now has 0 internal links (new orphaned page). Average internal links per page increased by 2. Resolved: /guide/topic-x gained 3 internal links."* Identifies pages with newly missing or incorrectly implemented canonical tags.
* Confirms resolution for pages that previously had canonical tag issues.
/category?sort=price is missing a canonical tag. Resolved: Canonical tag on /old-promo is now correctly implemented." * Detects pages newly missing essential Open Graph (OG) tags (og:title, og:description, og:image, og:url, og:type).
* Verifies resolution of previously identified OG tag issues.
/event/webinar is missing og:image. Resolved: All essential OG tags are now present on /news/article-123."* Performs a page-by-page comparison of LCP (Largest Contentful Paint), CLS (Cumulative Layout Shift), and FID (First Input Delay) scores.
* Highlights pages where any metric has significantly worsened (e.g., moved from "Good" to "Needs Improvement" or "Poor") or improved.
* Reports overall site average changes for each metric.
/homepage LCP worsened from 'Good' to 'Needs Improvement' (2.1s -> 3.5s). Page /blog/post-xyz CLS improved from 'Needs Improvement' to 'Good' (0.15 -> 0.08). Site-wide LCP average increased by 0.3s."* Identifies pages newly missing structured data (e.g., Schema.org markup) or containing validation errors.
* Confirms resolution of previously reported structured data issues.
/recipe/pizza is missing Recipe structured data. Resolved: Validation errors on /product/item-abc have been fixed."* Detects pages where the mobile viewport meta tag is newly missing or incorrectly configured.
* Verifies resolution of previously identified viewport issues.
/legacy-section is missing the viewport meta tag. Resolved: Viewport configuration on /checkout is now correct."The generated "diff" report is not a separate document but is integrated directly into the newly created SiteAuditReport. This report, containing both the current audit findings and the historical delta, is then stored in your hive_db (MongoDB).
The diff section within the SiteAuditReport will be structured to clearly indicate:
overall_site_changes: High-level summaries of changes across the entire site.page_level_changes: Specific URLs where issues were resolved, newly introduced, or performance metrics shifted.resolved_issues_count: Total number of issues fixed.new_issues_count: Total number of new issues detected.This comprehensive approach ensures that every audit provides immediate context against past performance, making it easier to prioritize fixes and monitor the impact of your SEO efforts.
This crucial step leverages Google's Gemini AI to automatically generate precise, actionable fixes for all SEO discrepancies identified by the headless crawler in the previous stage. This moves beyond merely identifying problems to providing concrete solutions, significantly streamlining your site's SEO optimization process.
Following the comprehensive audit conducted by our headless crawler (Puppeteer), which meticulously scanned every page on your site against a 12-point SEO checklist, a list of "broken elements" or SEO issues has been compiled. In this step, these identified issues are systematically fed into the Gemini AI model. Gemini's role is to intelligently analyze each specific problem and generate the exact, code-level or instruction-based fix required to resolve it.
The primary objective of the gemini → batch_generate step is to:
For each identified SEO issue, the Gemini model receives a structured set of data points, ensuring it has all the necessary context to generate an accurate fix. This input typically includes:
title="My Duplicate Title", <img src="image.jpg">).batch_generate)* Direct Code Snippets: For issues like missing attributes, incorrect tags, or structured data additions.
* Recommended Values: For meta titles, descriptions, alt text, or canonical URLs.
* Step-by-Step Instructions: For more complex issues requiring manual intervention or changes to CMS settings.
* Before/After Diffs: Presenting the problematic code and the suggested corrected code side-by-side.
The output from this step is a comprehensive list of suggested fixes, each linked to its original issue. For every problematic element, you will receive:
* Suggested Code Change: An exact HTML/CSS/JS snippet to replace or add.
* Recommended Text/Value: For meta tags, H1 content, alt descriptions, etc.
Clear Instructions: Guiding you on how* to implement the fix (e.g., "Add alt="Descriptive text" to this image tag," "Update the meta title to 'New Unique Title'," "Ensure this canonical tag points to https://example.com/canonical-page").
* Meta Title Uniqueness:
Problem:* <title>My Generic Page</title> on multiple pages.
Fix:* <title>My Generic Page - Specific Product Name</title> (with instruction to make it unique).
* Missing H1:
Problem:* No <h1> tag found.
Fix:* <h1>Main Heading of This Page Content</h1> (with instruction to place it appropriately).
* Image Alt Coverage:
Problem:* <img src="product.jpg">
Fix:* <img src="product.jpg" alt="Red Widget with Silver Trim">
* Broken Canonical Tag:
Problem:* <link rel="canonical" href="http://example.com/broken-link">
Fix:* <link rel="canonical" href="https://example.com/correct-canonical-page">
The generated fixes are a critical component of the SiteAuditReport that will be stored in MongoDB. Each fix will be associated with its corresponding detected issue, forming the "after" state in the before/after diff.
In subsequent steps, this detailed report, complete with AI-generated fixes, will be presented to you. This empowers you to review, prioritize, and implement the necessary changes to significantly improve your site's SEO performance with clear, actionable guidance.
What this means:
"status": "success": The operation completed without errors."upsertedCount": 1: A new document containing your SiteAuditReport has been created in the site_audit_reports collection."upsertedId": The unique MongoDB internal ID for this newly created report.You can verify the presence and content of this report by querying the hive_db for the auditId or siteUrl.
This step transforms raw audit data into actionable, persistent insights:
diffWithPreviousAudit immediately highlights what has improved and what new issues have arisen, allowing for focused remediation efforts.With the SiteAuditReport securely stored in hive_db, the final step of the workflow will focus on leveraging this data:
diffWithPreviousAudit and Gemini's suggested fixes. This will provide you with a clear, actionable summary of your site's SEO health.This step, hive_db → conditional_update, represents the successful finalization of your Site SEO Audit. All comprehensive audit data, including page-specific findings and Gemini-generated fixes, has been meticulously stored and updated within your dedicated MongoDB database.
This process ensures that a persistent record of your site's SEO performance is maintained, enabling historical tracking, performance comparisons, and detailed reporting.
The headless crawler has completed its scan, the 12-point SEO checklist has been applied to every page, and any identified issues have been processed by Gemini to generate precise fixes. This final step confirms that all this valuable information has been:
SiteAuditReport document, optimized for retrieval and analysis.SiteAuditReport StructureEach audit run generates a new document within the SiteAuditReports collection. This document is designed to provide a comprehensive snapshot of your site's SEO health at the time of the audit.
Key Fields of a SiteAuditReport Document:
auditId (UUID): A unique identifier for this specific audit run.siteUrl (String): The primary URL of the website that was audited (e.g., https://www.yourdomain.com).auditDate (ISODate): Timestamp indicating when this audit was completed.auditType (String): Specifies whether the audit was Scheduled (e.g., weekly) or On-Demand.overallSummary (Object): * totalPagesAudited: Number of unique pages processed.
* totalIssuesFound: Aggregate count of all SEO issues across the site.
* passingChecksPercentage: Overall percentage of checks that passed.
* coreWebVitalsSummary: Average/median LCP, CLS, FID across audited pages.
pagesAudited (Array of Objects): An array where each object represents the detailed audit results for a single page. * pageUrl (String): The URL of the specific page.
* seoChecks (Object): Detailed status for each of the 12 SEO checklist items for this page.
* metaTitle: { status: 'PASS'/'FAIL', details: '...' }
* metaDescription: { status: 'PASS'/'FAIL', details: '...' }
* h1Presence: { status: 'PASS'/'FAIL', details: '...' }
* imageAltCoverage: { status: 'PASS'/'FAIL', details: '...' }
* internalLinkDensity: { status: 'PASS'/'FAIL', details: '...' }
* canonicalTag: { status: 'PASS'/'FAIL', details: '...' }
* openGraphTags: { status: 'PASS'/'FAIL', details: '...' }
* coreWebVitals: { lcp: '...', cls: '...', fid: '...' }
* structuredData: { status: 'PASS'/'FAIL', details: '...' }
* mobileViewport: { status: 'PASS'/'FAIL', details: '...' }
* titleUniqueness: { status: 'PASS'/'FAIL', details: '...' }
* descriptionUniqueness: { status: 'PASS'/'FAIL', details: '...' }
* issuesFound (Array of Objects): A list of specific problems identified on this page.
* issueType: (e.g., "Missing H1", "Duplicate Meta Title")
* severity: (e.g., "Critical", "Warning", "Info")
* details: Contextual information about the issue.
* geminiFixes (Array of Objects): If issues were found, this array contains the AI-generated remediation steps.
* issueType: Matches an issueType from issuesFound.
* fixDescription: Human-readable explanation of the fix.
* codeSnippet (Optional): Exact code or configuration to apply.
* instructions: Step-by-step guidance for implementation.
previousAuditId (UUID, Optional): A reference to the auditId of the immediately preceding audit for this site. This is crucial for generating the diff.diffReport (Object, Optional): This field is populated when a previousAuditId exists, detailing the "before/after" comparison.The following critical information is now securely stored, forming the backbone of your SEO performance monitoring:
diffReport)A key feature of this workflow is the automatic generation of a diffReport. Upon completion of a new audit, the system intelligently compares the current audit's results against the most recent previous audit for your site (referenced by previousAuditId).
The diffReport includes:
* Number of new issues introduced.
* Number of previously existing issues resolved.
* Significant changes in overall Core Web Vitals performance.
Pages that have improved* (e.g., resolved issues, better CWV scores).
Pages that have regressed* (e.g., new issues, worsened CWV scores).
* Specific checks that have changed status (e.g., a page's "Meta Title Uniqueness" changed from FAIL to PASS).
This diff report provides immediate insights into the impact of recent changes on your site's SEO, highlighting areas of improvement and newly introduced regressions.
The SiteAuditReport documents, including the detailed page-level data and the "before/after" diff, are now available for retrieval. This data powers your SEO audit dashboard and reporting interface, allowing you to:
Your Site SEO Auditor is now fully operational:
Step 5, hive_db → conditional_update, marks the successful completion of the Site SEO Audit workflow. All audit results, including detailed page data, SEO issue identification, Core Web Vitals, and Gemini-generated remediation steps, have been durably stored in your MongoDB database. The "before/after" diff report is now available, providing actionable insights into your site's SEO evolution. Your site is now under continuous, intelligent SEO surveillance.
\n