gemini → batch_generate)This crucial step leverages advanced Artificial Intelligence, specifically the Gemini model, to transform identified SEO issues from the crawling phase into precise, actionable solutions. Rather than simply reporting problems, our system goes a significant step further by automatically generating the exact fixes you need to implement, dramatically streamlining your SEO improvement process.
The primary objective of the gemini → batch_generate step is to automate the creation of specific, code-level or instructional fixes for all detected SEO vulnerabilities. This proactive approach ensures that your team receives not just an audit, but a comprehensive action plan, eliminating the need for manual research and solution formulation for common SEO issues.
For each identified "broken element" or SEO issue, the Gemini model receives a detailed context, including:
src without an alt attribute, the element contributing to a poor LCP).Upon receiving the detailed input, the Gemini model performs the following:
The output from this step is a collection of highly detailed, actionable fixes for each identified SEO issue. These fixes are designed to be readily implementable by your development or content team. Each fix typically includes:
Example:* "The page is missing an H1 tag, which is crucial for content hierarchy and SEO."
Example (Missing H1):*
<!-- Original: <img src="/images/product.jpg"> -->
<!-- Proposed: -->
<img src="/images/product.jpg" alt="Detailed description of the product image for accessibility and SEO">
Workflow: Site SEO Auditor
Step Description: This initial step leverages Puppeteer, a Node.js library, to simulate a headless browser and systematically crawl every accessible page on your website. Its primary goal is to discover all unique URLs, capture the full HTML content of each page, and collect foundational performance metrics essential for the subsequent SEO audit.
This foundational step is critical for building a comprehensive understanding of your website's structure and content. By mimicking a real user's browser, Puppeteer ensures that JavaScript-rendered content, single-page applications (SPAs), and dynamic elements are fully loaded and accessible for auditing, which traditional HTTP-based crawlers might miss.
* For each page visited, Puppeteer waits for the page to fully load, including the execution of JavaScript and the rendering of dynamic content.
* It meticulously records all network requests made by the page (e.g., images, scripts, stylesheets, AJAX calls), which is crucial for identifying broken resources and understanding page dependencies.
* Upon loading a page, Puppeteer extracts all internal links (<a> tags pointing to other pages within your domain).
* These newly discovered URLs are added to a queue for subsequent crawling, ensuring a thorough exploration of your site's architecture.
The output of this crawling step provides the raw data necessary for the subsequent SEO audit. For each unique URL found on your site, the following information is collected:
Deliverable for this step:
A comprehensive, structured dataset containing:
Next Steps:
The data collected in this "puppeteer → crawl" step is immediately fed into Step 2: SEO Element Extraction & Core Web Vitals Measurement. In this subsequent step, the raw HTML will be parsed to extract specific SEO elements (meta tags, H1s, alt text, etc.), and detailed Core Web Vitals metrics (LCP, CLS, FID) will be measured using Lighthouse within the Puppeteer environment.
This initial crawl ensures that no corner of your website is left unexamined, providing a solid foundation for a precise and actionable SEO audit.
hive_db → diff)This document details the completion of Step 2 in your "Site SEO Auditor" workflow: generating a comprehensive "diff" report by comparing the latest SEO audit results with the previously stored audit data in your dedicated MongoDB instance (hive_db). This critical step provides a clear, actionable overview of changes and progress over time.
The primary objective of this step is to provide a granular comparison between the most recent SEO audit and the last recorded audit for your website. This "before and after" analysis is essential for:
Upon completion of the headless crawl and initial audit against the 12-point checklist, the system performs the following actions:
SiteAuditReport for your domain is fetched from the hive_db (MongoDB).* Quantitative Metrics (e.g., Core Web Vitals, Internal Link Density): Numerical values are compared, and percentage or absolute changes are calculated.
* Qualitative Metrics (e.g., H1 Presence, Canonical Tags, Structured Data Presence, Mobile Viewport, Open Graph Tags): Boolean states (pass/fail) or specific content attributes are compared to identify changes.
* Unique/Duplicate Checks (e.g., Meta Title/Description): The system identifies newly unique pages, pages that have become duplicates, or pages where content has changed.
* Image Alt Coverage: Changes in the number of images lacking alt text or improvements in coverage are tracked.
New Broken Elements: Issues identified in the current audit that were not* present in the previous one.
Resolved Broken Elements: Issues present in the previous audit that are no longer* detected in the current one.
* Persisting Broken Elements: Issues that remain unfixed across both audits.
diff object.The generated diff is a core component of the SiteAuditReport stored in hive_db. It provides a detailed, page-by-page and site-wide comparison.
A high-level overview of changes across the entire website:
For each audited URL, the diff will explicitly show changes for each of the 12 SEO checklist points:
https://yourdomain.com/example-page* Meta Title:
* Previous: "Old Page Title | Your Brand" (Duplicate with /another-page)
* Current: "New, Unique Page Title | Your Brand" (Unique)
* Change: ✓ Resolved Duplication, Content Updated
* Meta Description:
* Previous: "This is an old, generic description."
* Current: "A unique and engaging description for this specific page, optimized for search."
* Change: ✓ Content Updated
* H1 Presence:
* Previous: ✗ Missing H1
* Current: ✓ H1 Found: "Welcome to Our Example Page"
* Change: ✓ H1 Added
* Image Alt Coverage:
* Previous: 2/5 images missing alt text
* Current: 0/5 images missing alt text
* Change: ✓ Alt text added to 2 images
* Internal Link Density:
* Previous: 5 internal links detected
* Current: 8 internal links detected
* Change: ↑ Increased by 3 links
* Canonical Tags:
* Previous: ✗ Incorrect canonical pointing to /old-page
* Current: ✓ Correct canonical pointing to /example-page
* Change: ✓ Canonical Tag Corrected
* Open Graph Tags:
* Previous: ✗ Missing og:image, og:description
* Current: ✓ All essential OG tags present
* Change: ✓ OG Tags Added
* Core Web Vitals:
* LCP (Largest Contentful Paint):
* Previous: 3.5s (Poor)
* Current: 2.1s (Good)
* Change: ✓ Improved by 1.4s
* CLS (Cumulative Layout Shift):
* Previous: 0.15 (Needs Improvement)
* Current: 0.02 (Good)
* Change: ✓ Improved by 0.13
* FID (First Input Delay):
* Previous: 150ms (Needs Improvement)
* Current: 45ms (Good)
* Change: ✓ Improved by 105ms
* Structured Data Presence:
* Previous: ✗ No Schema.org detected
* Current: ✓ Product Schema detected
* Change: ✓ Structured Data Added
* Mobile Viewport:
* Previous: ✗ Viewport meta tag missing
* Current: ✓ Viewport meta tag present
* Change: ✓ Mobile Viewport Fixed
* Broken Elements (from Gemini analysis):
* New: [List of newly identified broken elements]
* Resolved: [List of previously broken elements that are now fixed]
* Persisting: [List of broken elements that remain unfixed]
The complete SiteAuditReport, including the detailed diff object, is securely stored in your dedicated MongoDB instance (hive_db). Each report is timestamped, allowing for easy retrieval and historical trend analysis.
This data will be accessible through the PantheraHive dashboard, providing a visual representation of your site's SEO evolution, highlighting key changes, and enabling you to drill down into specific page-level details.
With the diff successfully generated and stored, the system is ready to proceed to Step 3: "Gemini → Fix". This next step will specifically focus on leveraging the AI capabilities of Gemini to generate precise fixes for any new or persisting broken elements identified in this audit.
Example (LCP Optimization):* "Optimize the largest contentful paint element (e.g., hero image) by compressing its file size, using modern image formats (WebP/AVIF), and preloading it in the <head> section."
This AI-powered fix generation provides significant advantages:
The "exact fixes" generated by Gemini are a core component of the final SiteAuditReport. They will be presented alongside the identified issues, often in a clear "Recommendations" or "Actionable Fixes" section, providing a comprehensive view of the problem, its current state, and the precise solution. This forms the "after" state in the "before/after diff" stored in MongoDB, enabling clear tracking of improvements.
This crucial step involves the persistent storage of your site's SEO audit results within our secure MongoDB database. The data is meticulously structured into a SiteAuditReport document, enabling comprehensive tracking, historical analysis, and the generation of actionable insights.
Following the exhaustive crawling and analysis by Puppeteer and the generation of precise fixes by Gemini, all collected data is consolidated. This step ensures that every piece of information – from individual page metrics to identified issues and their proposed solutions, along with a critical before/after comparison – is reliably stored. This forms the foundation for ongoing SEO monitoring, performance measurement, and strategic decision-making.
SiteAuditReportEach audit run generates a new SiteAuditReport document in MongoDB, designed for clarity, depth, and historical comparison. Below is a detailed breakdown of its structure:
_id (ObjectId): MongoDB's unique identifier for this specific audit report.siteId (String): A unique identifier for your website, linking all audit reports to a single domain.auditTimestamp (Date): The exact date and time when this audit was completed, crucial for historical tracking.triggerType (String): * "Automatic": Indicates the audit was initiated by the weekly Sunday 2 AM schedule.
* "On-Demand": Indicates a manual trigger by a user.
pagesAudited (Array of Objects): A detailed list of every page visited and audited. * url (String): The full URL of the audited page.
* statusCode (Number): The HTTP status code returned for the page (e.g., 200, 404, 301).
* seoMetrics (Object): A comprehensive breakdown of the 12-point SEO checklist for this specific page:
* metaTitle (String): The page's meta title.
* metaTitleUnique (Boolean): True if unique across the site, false otherwise.
* metaDescription (String): The page's meta description.
* metaDescriptionUnique (Boolean): True if unique, false otherwise.
* h1Present (Boolean): True if an H1 tag is found.
* h1Content (String): The content of the first H1 tag (if present).
* imageAltCoverage (Number): Percentage of images with alt text.
* internalLinkCount (Number): Total number of internal links on the page.
* canonicalTagPresent (Boolean): True if a canonical tag is found.
* canonicalTagUrl (String): The URL specified in the canonical tag (if present).
* openGraphTagsPresent (Boolean): True if essential Open Graph tags are found.
* lcpScore (Number): Largest Contentful Paint score (ms).
* clsScore (Number): Cumulative Layout Shift score.
* fidScore (Number): First Input Delay score (ms).
* structuredDataPresent (Boolean): True if any structured data (Schema.org) is detected.
* mobileViewportPresent (Boolean): True if the viewport meta tag is correctly configured for mobile responsiveness.
* identifiedIssues (Array of Objects): A list of specific SEO issues found on this page.
* issueType (String): e.g., "Missing H1", "Duplicate Meta Title", "Low LCP".
* severity (String): e.g., "Critical", "High", "Medium", "Low".
* details (String): A descriptive explanation of the issue.
* elementSelector (String, optional): CSS selector to locate the problematic element.
* geminiFixes (Array of Objects): The exact, actionable fixes generated by Gemini for each identifiedIssue.
* issueType (String): Matches the issueType from identifiedIssues.
* fixDescription (String): Human-readable explanation of the fix.
* codeSnippet (String): The actual code (HTML, CSS, JS, JSON-LD) to implement the fix.
* targetFile (String, optional): Suggested file or area where the fix should be applied.
overallSummary (Object): Aggregated statistics for the entire site audit. * totalPagesAudited (Number).
* totalIssuesFound (Number).
* totalFixesGenerated (Number).
* averageLCP (Number).
* averageCLS (Number).
* averageFID (Number).
* uniqueMetaTitlesCount (Number).
* uniqueMetaDescriptionsCount (Number).
* pagesWithH1 (Number).
* pagesWithCanonical (Number).
* pagesWithOpenGraph (Number).
* pagesWithStructuredData (Number).
previousAuditId (ObjectId, optional): A reference to the _id of the immediately preceding audit report for the same siteId. This is critical for generating the diff.diffReport (Object): A comprehensive comparison between this audit and the previousAuditId. newIssues (Array of Objects): Issues identified in this audit that were not* present in the previous one.
resolvedIssues (Array of Objects): Issues present in the previous audit that are no longer* detected in this one (indicating successful fixes).
* metricChanges (Array of Objects): Key performance indicator (KPI) changes.
* metric (String): e.g., "Average LCP", "Image Alt Coverage".
* beforeValue (Number/String).
* afterValue (Number/String).
* change (Number/String): The delta or percentage change.
* status (String): e.g., "Improved", "Declined", "No Change".
status (String): The final state of the audit process: * "Completed": Audit successfully ran, data stored.
* "Issues Identified": Completed, with issues found and fixes generated.
* "Error": Indicates a failure during the audit process.
The hive_db → upsert operation intelligently handles data persistence:
siteId and auditTimestamp to uniquely identify each audit.SiteAuditReport for the given siteId. If found, its _id is stored in the previousAuditId field of the current report.previousAuditId and the current audit, the diffReport is calculated and populated.SiteAuditReport document is then inserted into the site_audit_reports collection in MongoDB.This mechanism ensures that a complete, traceable history of your site's SEO performance is maintained, always with a clear link to the preceding state for effective comparison.
diffReport provides immediate insight into the impact of SEO changes, highlighting improvements and regressions.Upon completion of this step, a comprehensive SiteAuditReport for your website is securely stored in our database. This report contains all audit findings, Gemini-generated fixes, and a detailed comparison against your previous audit. This rich dataset is now ready to be leveraged for advanced reporting and notifications.
The final step, Step 5: Reporting & Notifications, will utilize this stored SiteAuditReport to generate user-friendly reports and send out relevant notifications, ensuring you are promptly informed of your site's SEO status and actionable insights.
hive_db → conditional_update for "Site SEO Auditor" WorkflowThis is the final and crucial step in the "Site SEO Auditor" workflow, where all the gathered audit data, AI-generated fixes, and historical comparisons are persistently stored in your dedicated PantheraHive database. This ensures that a comprehensive, actionable, and trackable record of your site's SEO health is maintained.
The conditional_update operation is designed to store the complete SEO audit report, including page-level breakdowns, identified issues, AI-generated fixes, and a "before/after" comparison with the previous audit. This operation intelligently either creates a new SiteAuditReport document or updates an existing one (e.g., in cases of re-processing or partial updates), ensuring data integrity and efficiency.
conditional_update on SiteAuditReports CollectionThis step performs an upsert operation on the SiteAuditReports collection within your MongoDB instance.
SiteAuditReportssiteId and auditId (or auditDate) to identify whether a report for the current audit already exists. * Insert: If no existing report matches the current audit's siteId and auditId, a new SiteAuditReport document is created.
* Update: If a matching report is found, the existing document is updated with the latest and most complete data from the current audit run. This ensures that any interim or partial data is fully enriched.
SiteAuditReport DocumentEach SiteAuditReport document is a comprehensive record of your site's SEO health at a specific point in time. It includes the following key fields:
_id: A unique identifier for this specific audit report (MongoDB ObjectId).siteId: A reference to the website that was audited (e.g., yourdomain.com).auditId: A unique identifier for this specific audit run, often a UUID or timestamp-based ID.auditDate: Timestamp (ISO 8601 format) indicating when the audit was completed.triggerType: String indicating how the audit was initiated (scheduled for weekly runs, on_demand for manual triggers).overallScore: An aggregated numerical score representing the overall SEO health of the site (e.g., 0-100).pagesAudited: An array of objects, each representing a detailed audit for a single page on your site. * url: The full URL of the audited page.
* pageStatus: HTTP status code (e.g., 200, 404).
* metaTitle: The page's meta title.
* metaDescription: The page's meta description.
* h1Content: The content of the main H1 tag.
* imageAltCoverage: Percentage of images with alt attributes.
* internalLinkDensity: Number of internal links on the page.
* canonicalTag: The canonical URL specified (if any).
* openGraphTags: Object containing parsed Open Graph properties (e.g., og:title, og:image).
* coreWebVitals: Object containing LCP, CLS, and FID metrics for the page.
* structuredDataPresent: Boolean indicating if structured data (JSON-LD) was found.
* mobileViewportMeta: Boolean indicating if the viewport meta tag is correctly configured.
* issuesFound: An array of strings detailing specific SEO issues identified on this page (e.g., "Missing H1", "Duplicate Meta Title").
* aiFixesSuggested: An array of objects, generated by Gemini, providing exact code or content fixes for issues on this page.
* issue: The specific issue description.
* fixType: (e.g., "HTML", "Content", "Configuration").
* fixDescription: Human-readable explanation of the fix.
* codeSnippet (optional): The exact code snippet to implement the fix.
issuesSummary: An aggregated object summarizing all unique issues found across the entire site, including counts and affected URLs.aiFixesConsolidated: A consolidated list of all unique AI-generated fixes across the entire site, grouped by issue type or impact.beforeAfterDiff: An object detailing the comparison with the immediately preceding audit report. * previousAuditId: Reference to the _id of the previous audit report.
* overallScoreChange: Numerical difference in the overall SEO score.
newIssuesDetected: An array of issues found in the current audit that were not* present in the previous one.
issuesResolved: An array of issues present in the previous audit that are no longer* detected in the current one.
* metricChanges: An array of objects detailing significant changes in key metrics (e.g., LCP improvement/regression, alt text coverage change).
This feature is critical for understanding the evolution of your site's SEO performance.
SiteAuditReports collection to find the most recent audit report for the same siteId. * Score Changes: Any increase or decrease in the overallScore.
* Issue Resolution: Issues that were present previously but are now absent.
* New Issues: Issues that were not present previously but are now detected.
* Metric Shifts: Significant changes in Core Web Vitals (LCP, CLS, FID) or other quantifiable metrics like image alt coverage or internal link density.
beforeAfterDiff field of the current SiteAuditReport document.Upon completion of this step, the SiteAuditReport is fully stored in your PantheraHive database. You can access these reports through:
SiteAuditReports, visualize trends, and review the detailed issues and AI-generated fixes.