This step is crucial for persisting the comprehensive SEO audit results, including all identified issues and Gemini-generated fixes, into our secure MongoDB database. This ensures that your site's SEO performance history is meticulously recorded, enabling trend analysis and the "before/after" differential reporting.
The primary purpose of this hive_db → upsert operation is to store the complete SiteAuditReport document generated in the previous steps. This document encapsulates all findings from the headless crawl, the 12-point SEO checklist evaluation, Core Web Vitals measurements, structured data analysis, mobile viewport checks, and the AI-powered fix suggestions from Gemini.
By performing an upsert, we ensure:
PantheraHive (or a dedicated SeoAuditor database within PantheraHive).SiteAuditReportsEach document in the SiteAuditReports collection represents a single, complete SEO audit performed for a specific website at a given timestamp.
SiteAuditReport SchemaThe following details the structure of the SiteAuditReport document that will be upserted into MongoDB. This schema is designed for comprehensive reporting, historical comparison, and actionable insights.
{
"_id": ObjectId, // Unique identifier for this audit report (e.g., generated by MongoDB)
"siteUrl": String, // The full URL of the site that was audited (e.g., "https://www.example.com")
"auditTimestamp": ISODate, // Date and time when the audit was completed
"auditType": String, // "scheduled" (every Sunday 2 AM) or "on-demand"
"status": String, // "completed", "failed", "processing"
"pagesAuditedCount": Number, // Total number of unique pages successfully crawled and audited
"pagesWithIssuesCount": Number, // Number of pages where at least one SEO issue was detected
"overallSeoScore": { // An aggregate score reflecting overall SEO health (e.g., 0-100)
"value": Number,
"grade": String // e.g., "Excellent", "Good", "Fair", "Poor"
},
"summaryIssues": { // Aggregated summary of issues across the entire site
"metaTitle": { "failCount": Number, "pagesAffected": [String, ...] },
"metaDescription": { "failCount": Number, "pagesAffected": [String, ...] },
"h1Presence": { "failCount": Number, "pagesAffected": [String, ...] },
"imageAltCoverage": { "failCount": Number, "pagesAffected": [String, ...] },
"internalLinkDensity": { "failCount": Number, "pagesAffected": [String, ...] },
"canonicalTags": { "failCount": Number, "pagesAffected": [String, ...] },
"openGraphTags": { "failCount": Number, "pagesAffected": [String, ...] },
"coreWebVitals": {
"lcpFailCount": Number, "clsFailCount": Number, "fidFailCount": Number,
"pagesAffected": [String, ...]
},
"structuredDataPresence": { "failCount": Number, "pagesAffected": [String, ...] },
"mobileViewport": { "failCount": Number, "pagesAffected": [String, ...] },
// ... other checks as needed
},
"previousAuditReportId": ObjectId, // Reference to the _id of the immediately preceding audit report for this site (null if first audit)
"diffReport": { // Detailed comparison with the previous audit
"hasChanged": Boolean, // True if any significant change (new issue, resolved issue, metric change)
"newIssues": [ // Issues found in THIS audit that were NOT present in the previous one
{
"pageUrl": String,
"issueType": String, // e.g., "Missing H1", "Duplicate Meta Title", "Poor LCP"
"description": String,
"geminiFix": String // The exact fix generated by Gemini
}
],
"resolvedIssues": [ // Issues from the PREVIOUS audit that are NO LONGER present in this one
{
"pageUrl": String,
"issueType": String,
"description": String
}
],
"changedMetrics": [ // Significant changes in key metrics (e.g., overall score, Core Web Vitals)
{
"metricName": String, // e.g., "overallSeoScore", "LCP_average"
"oldValue": Any,
"newValue": Any,
"change": String // e.g., "+5", "-10ms"
}
]
},
"pageDetails": [ // An array containing detailed audit results for each page crawled
{
"pageUrl": String, // The URL of the specific page
"statusCode": Number, // HTTP status code of the page (e.g., 200, 404, 301)
"hasIssues": Boolean, // True if this page has any identified SEO issues
"seoChecks": {
"metaTitle": {
"status": String, // "PASS", "FAIL", "N/A"
"value": String, // The actual meta title found
"issues": [String], // e.g., ["Too long", "Not unique"]
"geminiFix": String // Exact fix generated by Gemini for this specific issue
},
"metaDescription": {
"status": String,
"value": String,
"issues": [String],
"geminiFix": String
},
"h1Presence": {
"status": String,
"value": String, // The actual H1 text found (or "N/A")
"issues": [String], // e.g., ["Missing H1", "Multiple H1s"]
"geminiFix": String
},
"imageAltCoverage": {
"status": String,
"totalImages": Number,
"imagesMissingAlt": Number,
"issues": [String], // e.g., ["3 images missing alt text"]
"geminiFix": String
},
"internalLinkDensity": {
"status": String,
"totalInternalLinks": Number,
"issues": [String], // e.g., ["Low internal link count"]
"geminiFix": String
},
"canonicalTags": {
"status": String,
"value": String, // The canonical URL found (or "N/A")
"issues": [String], // e.g., ["Missing canonical", "Self-referencing canonical incorrect"]
"geminiFix": String
},
"openGraphTags": {
"status": String,
"issues": [String], // e.g., ["Missing og:title", "Missing og:image"]
"geminiFix": String
},
"structuredDataPresence": {
"status": String,
"detectedTypes": [String], // e.g., ["Schema.org/Article", "Schema.org/Product"]
"issues": [String], // e.g., ["Missing required fields for Article schema"]
"geminiFix": String
},
"mobileViewport": {
"status": String,
"issues": [String], // e.g., ["Viewport meta tag missing"]
"geminiFix": String
},
// ... (other 12-point checks as defined)
},
"coreWebVitals": {
"lcp": { // Largest Contentful Paint
"value": Number, // in ms
"status": String, // "PASS", "FAIL"
"issues": [String] // e.g., ["LCP too slow (2.8s)"]
},
"cls": { // Cumulative Layout Shift
"value": Number,
"status": String,
"issues": [String]
},
"fid": { // First Input Delay (or INP if available)
"value": Number, // in ms
"status": String,
"issues": [String]
}
},
"allIssuesFoundOnPage": [ // Consolidated list of all issues for this page
{
"type": String, // e.g., "Meta Title", "H1", "LCP"
"description": String, // Specific issue description
"severity": String, // "Critical", "High", "Medium", "Low"
"geminiFix": String // The exact fix generated by Gemini
}
]
}
]
}
This document details the successful execution of the initial crawling phase for your "Site SEO Auditor" workflow. This crucial first step leverages a headless browser to comprehensively discover and capture the state of every page on your website, laying the foundation for a thorough SEO audit.
Objective: The primary goal of this step is to systematically visit and gather raw data from every discoverable page on your website. Unlike traditional HTTP crawlers, our approach simulates a real user's browser experience, ensuring that dynamic content, JavaScript-rendered elements, and single-page applications (SPAs) are fully processed and available for subsequent analysis.
Outcome: A comprehensive inventory of all unique URLs on your site, along with their fully rendered HTML content and associated network requests. This data forms the input for the detailed 12-point SEO checklist audit.
We utilize Puppeteer, a Node.js library, to control a headless Chromium browser. This technology choice is critical for several reasons:
Crawl Strategy:
sitemap.xml is discoverable, it is prioritized to ensure rapid discovery of known URLs.robots.txt directives, to avoid overwhelming your server and ensure a responsible crawl.During this phase, the headless browser performs the following actions for each discovered URL:
The successful completion of the crawl step yields the following raw data, which is then prepared for the subsequent auditing phase:
The data gathered in this crawling step is now securely stored and serves as the direct input for Step 2: SEO Audit Execution. The subsequent step will iterate through each discovered URL and its associated rendered content to apply the detailed 12-point SEO checklist, identifying specific areas for improvement.
This concludes Step 1: Site Crawl Initiation. We are now proceeding to Step 2: SEO Audit Execution, where the captured data will be analyzed against the 12-point SEO checklist.
This crucial step in the Site SEO Auditor workflow is responsible for comparing the newly generated, comprehensive SEO audit report with the previously stored audit data within your dedicated MongoDB instance (hive_db). The primary objective is to identify and highlight changes, improvements, and regressions across your website's SEO health, providing a clear "before and after" snapshot.
The "diff" generation serves several vital purposes for your SEO strategy:
To generate an accurate and meaningful diff, this step utilizes two primary data sources:
SiteAuditReport document retrieved from your hive_db (MongoDB). This serves as the reference point for comparison. If no previous report exists (e.g., for the very first audit), the "before" state will be considered empty, and the "diff" will effectively be the initial report itself.The diff generation process involves a meticulous, page-by-page and site-wide comparison across all 12 SEO checklist points.
For each URL audited in the current run, the system performs a direct comparison with its corresponding data from the previous audit. Key aspects compared at the page level include:
* New Pages: URLs found in the current audit but not in the previous one.
* Removed Pages: URLs found in the previous audit but not in the current one (e.g., due to redirects, deletions, or crawler blockages).
Beyond individual page comparisons, the system aggregates changes to provide a high-level overview of site-wide trends:
A core function of this step is to flag "broken elements" or significant deviations. This involves:
* Improvements: Metrics that moved towards an optimal state.
* Regressions: Metrics that moved away from an optimal state, indicating a new or worsened issue.
* No Change: Metrics that remained consistent.
/img/banner.jpg on /page-y is missing alt text," "LCP on /product-page-z increased from 2.0s to 3.5s").The generated diff is a structured JSON object that becomes an integral part of the new SiteAuditReport document. It is designed for clarity and actionability, presenting data at both a summary and granular level.
A high-level overview of the site's SEO performance changes:
overall_score_diff: * before: [Previous overall score]
* after: [Current overall score]
* change: [Difference: +X (improvement), -X (regression)]
page_count_diff: * total_pages_before: [Number of pages in previous audit]
* total_pages_after: [Number of pages in current audit]
* new_pages_detected: [List of new URLs]
* pages_no_longer_found: [List of URLs from previous audit not found now]
summary_of_changes_by_category: * on_page_seo: { improvements: X, regressions: Y, no_change: Z }
* technical_seo: { improvements: X, regressions: Y, no_change: Z }
* performance_core_web_vitals: { improvements: X, regressions: Y, no_change: Z }
A detailed breakdown of changes for individual URLs where significant differences were detected:
page_changes: [Array of objects] * url: https://yourdomain.com/example-page
* status: improved | regressed | no_significant_change | new_page | page_removed
* metric_diffs: [Array of objects for specific metric changes on this URL]
* metric_name: meta_title_uniqueness
* before: duplicate
* after: unique
* change_type: improvement
* details: Meta title is now unique across the site.
* broken_element: false (or true if it's a regression)
* element_locator: document.head > title (if applicable)
* current_value: Your Unique Title
* previous_value: Another Duplicate Title
A specific comparison for each of the 12 SEO checklist items, showing site-wide trends:
meta_title_description_diff: * unique_titles_before: X, unique_titles_after: Y, change: Z
* duplicate_titles_before: X, duplicate_titles_after: Y, change: Z (and list of URLs affected)
* missing_titles_before: X, missing_titles_after: Y, change: Z (and list of URLs affected)
(Similar structure for descriptions)*
h1_presence_diff: * pages_with_h1_before: X, pages_with_h1_after: Y, change: Z
* pages_missing_h1_before: X, pages_missing_h1_after: Y, change: Z (and list of URLs affected, marked as broken_element: true if newly missing)
* pages_multiple_h1_before: X, pages_multiple_h1_after: Y, change: Z (and list of URLs affected)
image_alt_coverage_diff: * images_total_before: X, images_total_after: Y
* images_with_alt_before: X, images_with_alt_after: Y, change: Z
* images_missing_alt_before: X, images_missing_alt_after: Y, change: Z (and list of specific image URLs + page URLs, marked as broken_element: true if newly missing)
* alt_coverage_percentage_before: X%, alt_coverage_percentage_after: Y%
internal_link_density_diff: * avg_internal_links_per_page_before: X, avg_internal_links_per_page_after: Y, change: Z
* pages_with_low_links_before: X, pages_with_low_links_after: Y, change: Z (and list of URLs)
canonical_tags_diff: * pages_with_canonical_before: X, pages_with_canonical_after: Y, change: Z
* pages_missing_canonical_before: X, pages_missing_canonical_after: Y, change: Z (and list of URLs, marked as broken_element: true if newly missing)
* pages_incorrect_canonical_before: X, pages_incorrect_canonical_after: Y, change: Z (and list of URLs with details)
open_graph_tags_diff: * pages_with_og_tags_before: X, pages_with_og_tags_after: Y, change: Z
* pages_missing_og_tags_before: X, pages_missing_og_tags_after: Y, change: Z (and list of URLs, marked as broken_element: true if newly missing)
* pages_incomplete_og_tags_before: X, pages_incomplete_og_tags_after: Y, change: Z (and list of URLs with details)
core_web_vitals_diff: * lcp_diff: { avg_before: X, avg_after: Y, change: Z, pages_regressed: [URLs], pages_improved: [URLs] }
* cls_diff: { avg_before: X, avg_after: Y, change: Z, pages_regressed: [URLs], pages_improved: [URLs] }
* fid_diff: { avg_before: X, avg_after: Y, change: Z, pages_regressed: [URLs], pages_improved: [URLs] }
(Specific pages where CWV crossed "good" or "needs improvement" thresholds will be highlighted as broken elements if regressed).*
structured_data_diff: * pages_with_sd_before: X, pages_with_sd_after: Y, change: Z
* pages_missing_sd_before: X, pages_missing_sd_after: Y, change: Z (and list of URLs, marked as broken_element: true if newly missing)
* pages_sd_errors_before: X, pages_sd_errors_after: Y, change: Z (and list of URLs with error details)
mobile_viewport_diff: * pages_with_viewport_before: X, pages_with_viewport_after: Y, change: Z
* pages_missing_viewport_before: X, pages_missing_viewport_after: Y, change: Z (and list of URLs, marked as broken_element: true if newly missing)
The generated diff is not just a comparison; it's a direct input for actionable steps. All identified "broken elements" or significant regressions (e.g., a page losing its H1, an image losing its alt text, a Core Web Vital score dipping below recommended thresholds) are specifically flagged within this diff. This precise identification allows for targeted remediation.
The complete diff data, structured as described above, is then stored as a nested object within the newly created SiteAuditReport document in your hive_db (MongoDB). This ensures that each audit report contains its own historical comparison, making it easy to retrieve and analyze past changes.
The diff generated in this step is immediately passed to the next stage of the workflow. Specifically, the identified "broken elements" and regressions will be sent to Gemini, which will leverage its AI capabilities to generate exact, actionable fixes for these issues. This ensures a seamless transition from detection to resolution within your SEO workflow.
This step focuses on leveraging Google's Gemini AI to automatically generate precise, actionable fixes for all identified SEO issues across your website. Following the comprehensive audit by our headless crawler, any "broken elements" or non-compliant SEO attributes are systematically fed into Gemini for intelligent remediation.
Our system aggregates all detected SEO deficiencies from the initial crawling phase. This includes, but is not limited to, missing meta descriptions, duplicate titles, images without alt text, missing H1s, or incorrect canonical tags. These issues are then batched and sent to Gemini, which analyzes each specific problem in context and provides a tailored solution.
* The specific URL where the issue was found.
* The type of SEO issue (e.g., "Missing Meta Description", "Duplicate H1", "Image without Alt Text", "Incorrect Canonical Tag").
* The relevant HTML snippet or contextual information (e.g., the <img> tag for missing alt text, the <head> section for meta tags).
* Any associated audit metrics (e.g., the duplicate title it's clashing with, the specific image URL).
* Issue: Image on /about-us has no alt attribute: <img src="/img/team.jpg">
* Gemini Fix: Add descriptive alt text: <img src="/img/team.jpg" alt="Our dedicated team working together">
* Issue: Meta Description missing on /products/item-123
* Gemini Fix: Add unique meta description: <meta name="description" content="Discover the features and benefits of product X, designed for optimal performance and user satisfaction.">
* Issue: Duplicate H1 tag on /blog/post-title
* Gemini Fix: Revise secondary H1 to H2: Change <h1>Related Posts</h1> to <h2>Related Posts</h2>
* url: The page URL where the fix applies.
* issue_type: The original SEO issue identified.
* original_element_html: The problematic HTML snippet.
* recommended_fix_html: The exact HTML or code snippet to replace/add.
* fix_description: A human-readable explanation of the fix and why it's recommended.
* severity: The SEO severity level of the issue (e.g., Critical, High, Medium, Low).
* fix_status: (Initially) "Generated", indicating it's ready for review/implementation.
SiteAuditReport. This ensures a complete record of issues and their proposed solutions.This step transforms raw audit data into actionable intelligence, providing you with a clear roadmap to optimize your website's SEO performance effectively and efficiently.
siteUrl field acts as the primary identifier for tracking a specific website's audit history.SiteAuditReports collection to find the most recent SiteAuditReport document for the given siteUrl where status is "completed".newIssues, resolvedIssues, and changedMetrics.previousAuditReportId: The _id of the retrieved previous report is assigned to the previousAuditReportId field in the current SiteAuditReport document.SiteAuditReport document, incorporating the calculated diffReport and previousAuditReportId, is then inserted into the SiteAuditReports collection. Since each audit is a snapshot in time, we are always inserting a new document rather than updating an existing one, ensuring a complete historical record.siteUrl, auditTimestamp, and _id are indexed to ensure efficient querying and retrieval of audit reports, especially for historical lookup and differential reporting.hive_db are authenticated and authorized, ensuring that only the "Site SEO Auditor" workflow has the necessary permissions to perform this upsert operation.Upon successful completion of this step, a new SiteAuditReport document will be available in the SiteAuditReports collection. This report is immediately accessible:
diffReport (e.g., new critical issues, a drop in overall SEO score).With the audit report successfully stored in hive_db, the workflow proceeds to the final step:
hive_db → conditional_update - Site SEO Auditor Database UpdateThis deliverable confirms the successful execution of the final step in your "Site SEO Auditor" workflow. This crucial step involves persisting all gathered audit data, SEO recommendations, and performance metrics into our secure, scalable MongoDB database.
Status: COMPLETED
Step 5 of 5: hive_db → conditional_update for the "Site SEO Auditor" workflow has been successfully executed. All audit findings, including the 12-point SEO checklist results, Core Web Vitals, Gemini-generated fixes, and the before/after differential report, have been securely stored in your dedicated MongoDB instance.
The hive_db → conditional_update step is fundamental to the value proposition of the Site SEO Auditor. Its primary purposes are:
Each audit run generates a SiteAuditReport document in MongoDB, structured to capture every detail of your site's SEO health. This includes:
auditId: Unique identifier for each audit report.siteUrl: The root URL of the audited website.auditTimestamp: Date and time when the audit was completed.auditType: scheduled (e.g., weekly) or on-demand.status: completed, in_progress, failed.totalPagesAudited: Total number of unique pages crawled and audited.overallSeoHealthScore: An aggregated score reflecting the site's general SEO health.totalIssuesIdentified: Count of all unique SEO issues found across the site.pageAudits)For each page crawled by Puppeteer, the following detailed information is stored:
url: The specific URL of the audited page.httpStatus: HTTP status code returned (e.g., 200, 301, 404).loadTimeMs: Page load time in milliseconds.puppeteerMetrics: Raw metrics from Puppeteer (e.g., DOMContentLoaded, FirstContentfulPaint).seoChecks: Detailed results for each of the 12 SEO checklist points: * metaTitle:
* value: The actual meta title.
* isPresent: Boolean.
* isUnique: Boolean (site-wide).
* lengthStatus: PASS/WARNING/FAIL based on character count.
* metaDescription:
* value: The actual meta description.
* isPresent: Boolean.
* isUnique: Boolean (site-wide).
* lengthStatus: PASS/WARNING/FAIL.
* h1:
* value: The H1 content.
* isPresent: Boolean.
* count: Number of H1s found (should be 1).
* status: PASS/FAIL (e.g., missing or multiple H1s).
* imageAltCoverage:
* percentage: Percentage of images with alt text.
* missingAlts: Array of image src URLs missing alt text.
* status: PASS/WARNING/FAIL.
* internalLinkDensity:
* count: Total number of internal links.
* links: Array of internal link href values.
* status: PASS/WARNING (e.g., very low density).
* canonicalTag:
* isPresent: Boolean.
* value: The canonical URL, if present.
* isValid: Boolean (e.g., self-referencing, valid URL).
* openGraphTags:
* isPresent: Boolean.
* properties: Object containing key OG properties (e.g., og:title, og:image).
* status: PASS/WARNING (e.g., missing essential properties).
* coreWebVitals:
* lcp (Largest Contentful Paint): score, status (PASS/WARNING/FAIL).
* cls (Cumulative Layout Shift): score, status.
* fid (First Input Delay): score, status.
* structuredData:
* isPresent: Boolean.
* types: Array of detected schema types (e.g., Article, Product).
* isValid: Boolean (based on basic validation).
* mobileViewport:
* isPresent: Boolean (meta viewport tag).
* configuration: String (e.g., width=device-width, initial-scale=1.0).
* status: PASS/FAIL.
issuesIdentified)For each detected SEO issue on a specific page:
check: The SEO check that failed (e.g., metaTitle, h1, imageAltCoverage).description: A human-readable description of the issue.severity: CRITICAL, MAJOR, MINOR.geminiFix: * originalIssue: Detailed problem statement provided to Gemini.
* suggestedFix: Exact, actionable code or content fix generated by Gemini.
* fixConfidence: Gemini's confidence score for the fix.
diffReport)This critical section provides a concise overview of changes since the last audit:
previousAuditId: The _id of the prior audit report for comparison (null if this is the first audit).changes: An array detailing specific metric changes: * pageUrl: The URL where the change occurred.
* metric: The specific SEO metric that changed (e.g., pagesAudited[0].seoChecks.metaTitle.isUnique).
* oldValue: Value from the previous audit.
* newValue: Value from the current audit.
* changeType: improved, regressed, new_issue, issue_resolved.
SiteAuditReport Document Structure (Simplified)
{
"_id": ObjectId("653b6d2e6a7b8c9d0e1f2a3b"),
"auditId": "seo-audit-20231027-020000",
"siteUrl": "https://www.example.com",
"auditTimestamp": ISODate("2023-10-27T02:00:00Z"),
"auditType": "scheduled",
"status": "completed",
"totalPagesAudited": 150,
"overallSeoHealthScore": 85,
"totalIssuesIdentified": 12,
"pagesAudited": [
{
"url": "https://www.example.com/",
"httpStatus": 200,
"loadTimeMs": 1250,
"seoChecks": {
"metaTitle": { "value": "Example Home Page", "isUnique": true, "status": "PASS", "lengthStatus": "PASS" },
"h1": { "value": "Welcome to Example!", "isPresent": true, "count": 1, "status": "PASS" },
"imageAltCoverage": { "percentage": 95, "missingAlts": ["/images/logo.png"], "status": "WARNING" },
"coreWebVitals": {
"lcp": { "score": 2.1, "status": "PASS" },
"cls": { "score": 0.05, "status": "PASS" },
"fid": { "score": 50, "status": "PASS" }
}
// ... other checks
},
"issuesIdentified": [
{
"check": "imageAltCoverage",
"description": "Image '/images/logo.png' is missing alt text.",
"severity": "MINOR",
"geminiFix": {
"originalIssue": "The <img> tag for '/images/logo.png' lacks an 'alt' attribute.",
"suggestedFix": "Add `alt=\"Example Company Logo\"` to the `<img>` tag for `/images/logo.png`.",
"fixConfidence": 0.95
}
}
]
},
{
"url": "https://www.example.com/blog/latest-post",
"httpStatus": 200,
"loadTimeMs": 2800,
"seoChecks": {
"metaTitle": { "value": "Latest Blog Post", "isUnique": false, "status": "FAIL", "lengthStatus": "PASS" },
"h1": { "value": "Latest Blog Post Title", "isPresent": true, "count": 1, "status": "PASS" },
"coreWebVitals": {
"lcp": { "score": 3.5, "status": "FAIL