Workflow: Site SEO Auditor
This deliverable outlines the successful execution of Step 1: puppeteer → crawl for the "Site SEO Auditor" workflow. In this crucial initial phase, a headless browser, powered by Puppeteer, systematically navigates and extracts comprehensive data from every accessible page on your website. This process simulates a real user's journey, ensuring an accurate and thorough collection of raw data essential for the subsequent 12-point SEO audit.
The output of this step is a meticulously structured dataset containing all discovered URLs and, for each URL, a rich collection of SEO-relevant attributes. This raw data forms the foundational input for the detailed SEO analysis and fix generation in the subsequent steps.
The crawl process is initiated using Google's Puppeteer library, which controls a headless Chromium browser instance. This ensures that the site is accessed and rendered exactly as a typical user's browser would, capturing dynamic content and client-side rendered elements.
networkidle0 or networkidle2) to ensure all resources (JS, CSS, images) have loaded and the page is fully rendered before data extraction. This is critical for single-page applications (SPAs) or sites with heavy JavaScript.The crawler employs a comprehensive strategy to discover all accessible pages within your domain.
<a> tags with href attributes pointing to pages within the same domain.For every successfully crawled page, the following SEO-critical data points are extracted directly from the fully rendered DOM:
<title> tag. Example:* <title>Your Page Title Here</title>
<meta name="description"> tag. Example:* <meta name="description" content="A concise summary of your page.">
* Presence (boolean) of at least one <h1> tag.
* The text content of the first <h1> tag found.
Example:* <h1>Main Heading of the Page</h1>
* A count of all <img> tags on the page.
* A count of <img> tags missing the alt attribute.
Example:* <img src="image.jpg" alt="Descriptive alt text">
* The total count of internal links (<a> tags pointing to the same domain).
* The text content (anchor text) and href for each internal link.
* The href attribute of the <link rel="canonical"> tag, if present.
Example:* <link rel="canonical" href="https://www.yourdomain.com/canonical-page/">
* og:title, og:description, og:image, og:url, og:type properties extracted from <meta property="og:..."> tags.
Example:* <meta property="og:title" content="Open Graph Title">
* Detection of <script type="application/ld+json"> tags. The full JSON-LD content is extracted for later validation.
* Presence (boolean) of <meta name="viewport" content="..."> tag.
* The full content attribute value for detailed analysis.
Example:* <meta name="viewport" content="width=device-width, initial-scale=1">
Leveraging Puppeteer's capabilities, Core Web Vitals (CWV) are measured for each page by integrating with Lighthouse. This provides real-world performance metrics as perceived by users.
These metrics are captured under controlled lab conditions, providing a consistent baseline for performance evaluation.
The output of this step is a comprehensive, structured JSON object, representing the raw data collected from the entire website. This object is the input for the subsequent SEO audit and analysis.
Example Output Structure (Conceptual):
{
"crawlTimestamp": "2023-10-27T10:00:00Z",
"startingUrl": "https://www.yourdomain.com/",
"crawledPages": [
{
"url": "https://www.yourdomain.com/",
"statusCode": 200,
"responseTimeMs": 550,
"metaTitle": "Your Website - Home Page",
"metaDescription": "Welcome to your website, discover our services.",
"h1Present": true,
"h1Content": "Welcome to Our Platform",
"imageCount": 15,
"imagesMissingAlt": 2,
"internalLinkCount": 25,
"internalLinks": [
{"href": "https://www.yourdomain.com/about", "anchorText": "About Us"},
// ... more links
],
"canonicalTag": "https://www.yourdomain.com/",
"openGraph": {
"ogTitle": "Your Website - Home Page",
"ogDescription": "Welcome to your website, discover our services.",
"ogImage": "https://www.yourdomain.com/og-image.jpg",
"ogUrl": "https://www.yourdomain.com/"
},
"structuredDataPresent": true,
"structuredDataContent": [
// Raw JSON-LD content
],
"viewportMetaPresent": true,
"viewportContent": "width=device-width, initial-scale=1",
"coreWebVitals": {
"lcp": "1.8s",
"cls": 0.01,
"tbt": "150ms" // Proxy for FID
}
},
{
"url": "https://www.yourdomain.com/about",
// ... similar data for the about page
},
// ... data for all other crawled pages
],
"uncrawledUrls": [
// List of URLs that failed to crawl with error details
]
}
The comprehensive dataset generated by this Puppeteer crawl is now ready for the subsequent steps in the "Site SEO Auditor" workflow:
SiteAuditReport. This includes a "before/after" diff for tracking changes over time.hive_db → diff - Comprehensive Audit Difference AnalysisThis crucial step in the Site SEO Auditor workflow focuses on providing a granular "before and after" comparison of your website's SEO health. Following the completion of the headless crawl and initial audit, all current findings have been meticulously stored in your dedicated hive_db MongoDB instance. Now, we proceed to perform a sophisticated difference analysis, comparing these latest results against your most recent previous audit report.
This diff operation is designed to transform raw audit data into actionable insights, highlighting changes, improvements, and regressions across your site's SEO landscape.
The primary goal of the diff step is to:
Our system executes the diff operation through the following sub-steps:
* The system retrieves the latest complete SiteAuditReport (generated in Step 1) from your hive_db.
* Concurrently, it fetches the immediately preceding SiteAuditReport from the same database, establishing the baseline for comparison.
* For every URL audited in the current report, the system attempts to find a corresponding URL in the previous report.
* New pages identified in the current audit (not present in the previous one) will be flagged as "newly audited."
* Pages no longer present will be flagged as "removed/redirected."
* For each matching URL, a detailed comparison is performed across all 12 SEO checklist points. This includes both quantitative metrics (e.g., Core Web Vitals scores, internal link count) and qualitative presence checks (e.g., H1 presence, canonical tag validity).
* Each metric is evaluated to determine if there has been an improvement, a regression, a new issue, a resolved issue, or no change.
The diff process meticulously compares the status and values for each of the following SEO checklist items, per page:
Diff Check:* Has a meta title been added/removed? Has it changed? Is it now unique (or no longer unique) across the site?
Diff Check:* Has a meta description been added/removed? Has it changed? Is it now unique (or no longer unique)?
Diff Check:* Is an H1 now present/absent? Has the H1 content changed? Are there now multiple H1s where there weren't before?
Diff Check:* Has the percentage of images with alt attributes improved or declined? Are new images missing alt text?
Diff Check:* Has the number of internal links changed significantly? Are there new broken internal links?
Diff Check:* Is a canonical tag now present/absent? Has it changed? Is it now self-referencing and valid (or no longer)?
Diff Check:* Are OG tags now present/absent? Have key OG properties (e.g., og:title, og:image) changed or become invalid?
Diff Check:* Has the Largest Contentful Paint (LCP), Cumulative Layout Shift (CLS), or First Input Delay (FID) score improved or worsened? Are pages now passing/failing the Core Web Vitals assessment?
Diff Check:* Is structured data now present/absent? Has the type or validity of the structured data changed?
Diff Check:* Is the viewport meta tag now correctly configured (or incorrectly configured)?
The results of this diff operation are integrated directly into the SiteAuditReport stored in MongoDB. This creates a rich, historical record that includes:
* New Issue: A problem detected in the current audit that was not present in the previous one.
* Resolved Issue: A problem from the previous audit that is no longer present.
* Improved Metric: A quantitative metric (e.g., LCP score) that has moved towards a more favorable state.
* Regressed Metric: A quantitative metric that has moved towards a less favorable state.
* No Change: The metric's status or value remains the same.
This comprehensive diff data will be the foundation for the visual reporting and actionable recommendations you receive, allowing you to quickly grasp the most significant changes since the last audit.
This phase marks the transition from identifying SEO issues to generating actionable, precise solutions. Leveraging the advanced capabilities of Google's Gemini AI, we don't just point out problems; we provide the exact fixes needed to resolve them, delivered in a structured, ready-to-implement format.
Following the comprehensive site crawl and audit (Step 2), a detailed list of "broken elements" and SEO deficiencies is compiled. In this crucial step, this raw audit data is fed into the Gemini AI model. Gemini acts as an expert SEO developer, analyzing each identified issue within its page context and generating specific, actionable code snippets, content recommendations, or configuration suggestions to rectify the problem. The "batch_generate" aspect ensures that all identified issues across your entire site are processed efficiently and systematically.
* The exact URL of the affected page.
* The specific SEO rule that was violated (e.g., "Missing H1 Tag," "Duplicate Meta Description," "Image without Alt Text").
* Relevant surrounding HTML, text content, or performance metrics.
* Contextual information, such as the page's primary content, existing titles, and descriptions.
* Understanding the Problem: Interpreting the nature of the SEO issue (e.g., why is a meta description duplicate? What is the main topic of a page missing an H1?).
* Analyzing Page Content: Reading and understanding the content of the affected page to ensure fixes are relevant and contextually appropriate. For example, if an alt tag is missing, Gemini attempts to describe the image based on its filename or surrounding text. If an H1 is missing, it suggests one based on the page's title or main body content.
* Applying SEO Best Practices: Leveraging its vast training data on SEO guidelines and web development best practices to formulate optimal solutions.
Gemini's output is diverse, covering all aspects of the 12-point SEO checklist. Here are examples of the "exact fixes" it generates:
* Meta Titles & Descriptions:
Issue*: Duplicate or missing meta descriptions, titles that are too long/short.
Fix*: Generates unique, concise, and keyword-rich <title> and <meta name="description"> tags tailored to the page's content.
* H1 Tags:
Issue*: Missing H1, multiple H1s, or irrelevant H1 content.
Fix*: Suggests a single, relevant <h1> tag based on page content, or identifies which existing H1 to prioritize.
* Image Alt Attributes:
Issue*: Images missing alt attributes.
Fix*: Provides descriptive alt text for images, considering their context on the page.
* Canonical Tags:
Issue*: Missing or incorrect <link rel="canonical"> tags.
Fix*: Generates the correct self-referencing canonical URL or points to the appropriate canonical for duplicate content.
* Open Graph (OG) Tags:
Issue*: Missing og:title, og:description, og:image, etc.
Fix*: Creates relevant Open Graph tags to optimize social sharing previews, drawing content from existing meta tags or page content.
* Mobile Viewport:
Issue*: Missing or improperly configured <meta name="viewport"> tag.
Fix*: Provides the standard responsive viewport meta tag for optimal mobile rendering.
* Issue: Missing or incorrect Schema.org markup (e.g., for Articles, Products, FAQs, LocalBusiness).
* Fix: Generates complete and valid JSON-LD script blocks (<script type="application/ld+json">) populated with data extracted from the page content, ready for direct implementation.
* Issue: Poor Core Web Vitals (LCP, CLS, FID) performance.
* Fix: While not always direct code, Gemini provides highly specific recommendations, such as:
For LCP (Largest Contentful Paint)*: Suggestions for image optimization (e.g., specifying dimensions, using modern formats, lazy loading), critical CSS inlining, or deferring non-critical JavaScript.
For CLS (Cumulative Layout Shift)*: Recommendations for reserving space for ads/embeds, specifying image/video dimensions.
For FID (First Input Delay)*: Advice on breaking up long JavaScript tasks or optimizing third-party scripts.
* Issue: Pages with low internal link density or missed opportunities for internal linking.
* Fix: Suggests specific anchor text and target pages for new internal links, improving site architecture and crawlability.
The output from the gemini → batch_generate step is a structured data set, typically in JSON format, designed for easy consumption and implementation. Each identified issue receives a corresponding fix, presented with:
page_url: The URL of the page where the fix needs to be applied.issue_type: A clear description of the original SEO problem (e.g., "MISSING_H1_TAG", "DUPLICATE_META_DESCRIPTION").fix_type: Categorization of the fix (e.g., "HTML_UPDATE", "JSON_LD_ADDITION", "OPTIMIZATION_RECOMMENDATION").fix_code: The exact HTML snippet, JSON-LD block, or specific configuration instruction to implement.fix_description: A human-readable explanation of what the fix does and why it's recommended.severity: The priority level of the issue (e.g., "Critical", "High", "Medium", "Low").Example Output Structure (Partial):
[
{
"page_url": "https://yourwebsite.com/blog/article-1",
"issue_type": "MISSING_META_DESCRIPTION",
"fix_type": "HTML_UPDATE",
"fix_code": "<meta name=\"description\" content=\"Discover the latest trends in AI and machine learning with our in-depth analysis and expert insights.\">",
"fix_description": "Generates a unique and concise meta description based on the article content to improve click-through rates from search results.",
"severity": "Critical"
},
{
"page_url": "https://yourwebsite.com/products/widget-pro",
"issue_type": "IMAGE_MISSING_ALT_TEXT",
"element_selector": "img[src='/images/widget-pro.jpg']",
"fix_type": "HTML_UPDATE",
"fix_code": "<img src=\"/images/widget-pro.jpg\" alt=\"Widget Pro: Advanced AI-powered productivity tool\">",
"fix_description": "Adds descriptive alt text to an image for improved accessibility and SEO image indexing.",
"severity": "High"
},
{
"page_url": "https://yourwebsite.com/about-us",
"issue_type": "MISSING_VIEWPORT_TAG",
"fix_type": "HTML_UPDATE",
"fix_code": "<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">",
"fix_description": "Adds the standard responsive viewport meta tag to ensure proper rendering across all mobile devices.",
"severity": "Medium"
}
]
This step transforms raw audit data into immediately actionable tasks. By providing exact, contextually relevant fixes, it drastically reduces the manual effort required for SEO optimization. These generated fixes are then stored in MongoDB as part of the SiteAuditReport, enabling a clear "before/after" comparison and serving as the foundation for tracking improvement over time. This detailed output is designed to be directly consumable by your development team for efficient implementation.
This document outlines the execution of Step 4 of 5 in your "Site SEO Auditor" workflow: hive_db → upsert.
This crucial step is responsible for securely storing all generated SEO audit data, including the comprehensive 12-point checklist results, Gemini-generated fixes, and the insightful before/after diffs, into your dedicated MongoDB database.
hive_db → upsert - Data Persistence and Historical TrackingThis step represents the culmination of the crawling, auditing, and fix generation phases. It ensures that all the valuable insights gathered about your website's SEO performance are permanently stored and made accessible for historical analysis, trend tracking, and future reporting.
The primary purpose of the hive_db → upsert operation is to:
SiteAuditReport for the current audit run in your MongoDB database.upsert mechanism to either insert a new audit report document or, in specific scenarios (e.g., re-processing a failed audit run identifier), update an existing one, ensuring data consistency.SiteAuditReportA comprehensive SiteAuditReport document is generated and stored for each audit run. This document is designed to be self-contained and provide a complete snapshot of your site's SEO health at the time of the audit.
Each SiteAuditReport document will contain the following key fields:
auditId (String): A unique identifier for this specific audit run.siteUrl (String): The root URL of the website that was audited.timestamp (Date): The exact date and time when the audit was completed.overallStatus (String): An aggregated status (e.g., "Pass", "Warning", "Critical Issues") based on the audit results.overallScore (Number): A calculated numerical score reflecting the overall SEO health of the site for this audit run.pagesAuditedCount (Number): The total number of unique pages successfully crawled and audited.issuesDetectedCount (Number): The total count of unique SEO issues found across all pages.pageReports (Array of Objects): An array where each object represents the detailed audit results for a specific page: * pageUrl (String): The URL of the audited page.
* pageStatus (String): Status for this specific page (e.g., "Good", "Needs Attention", "Critical").
* metrics (Object): Detailed results for each of the 12 SEO checklist points:
* metaTitle: { currentValue, status (Pass/Fail), issues (Array of strings), fixSuggestion (String from Gemini) }
* metaDescription: { currentValue, status, issues, fixSuggestion }
* h1Presence: { status, issues, fixSuggestion }
* imageAltCoverage: { status, issues (e.g., list of images missing alt), fixSuggestion }
* internalLinkDensity: { status, count, issues (e.g., broken links), fixSuggestion }
* canonicalTag: { currentValue, status, issues, fixSuggestion }
* openGraphTags: { status, issues (e.g., missing essential tags), fixSuggestion }
* coreWebVitals: { lcpScore, clsScore, fidScore, overallStatus, issues, fixSuggestion }
* structuredData: { status, detectedTypes (Array of strings), issues, fixSuggestion }
* mobileViewport: { status, issues, fixSuggestion }
(...and other checklist items)*
* geminiGeneratedFixes (Array of Strings): A collection of specific, actionable fix suggestions generated by Gemini for this page.
beforeAfterDiff (Object): A summary of changes compared to the immediately preceding audit run for the same site: * previousAuditId (String): The auditId of the previous audit run used for comparison.
* overallScoreChange (Number): The change in overallScore (positive indicates improvement).
* newIssuesDetected (Array of Objects): A list of new issues identified since the last audit.
* issuesResolved (Array of Objects): A list of issues that were present in the previous audit but are now resolved.
* pageLevelChanges (Array of Objects): Summaries of significant changes on individual pages.
The upsert operation in MongoDB is a powerful update command that creates a new document if no document matches the query criteria, or updates the existing document(s) if matches are found.
In the context of the Site SEO Auditor, the workflow for this step is as follows:
SiteAuditReport for your siteUrl.beforeAfterDiff object.SiteAuditReport document (including the calculated diff) is then inserted into the SiteAuditReports collection in MongoDB. While the step is named upsert, for historical audit reports, the typical behavior is to insert a new document for each run to maintain a full audit trail. An upsert could be used if a specific auditId (e.g., for a re-run of a specific ID) needs to be updated. The key outcome is the persistence of the detailed report.Storing your SiteAuditReport in MongoDB provides several key benefits:
Upon completion of this step, you will have:
With the audit data successfully stored in MongoDB, the workflow will proceed to the final step (Step 5 of 5): generate_report. This step will leverage the newly stored SiteAuditReport to generate a user-friendly, comprehensive report that highlights key findings, recommendations, and the before/after diff, which will then be delivered to you.
hive_db → conditional_update - Site SEO AuditorThis final step confirms the successful processing, storage, and update of your website's SEO audit report within our secure MongoDB database. This action ensures that all audit findings, performance metrics, identified issues, and recommended fixes are persistently recorded and accessible for tracking your site's SEO health over time.
Status: COMPLETE
The comprehensive SEO audit for your site has been successfully executed, and all generated data has been processed and stored. This includes:
A new SiteAuditReport document has been created or an existing one updated in your dedicated MongoDB collection, reflecting the latest state of your website's SEO profile.
SiteAuditReport Schema & Content OverviewEach SiteAuditReport document in your MongoDB database is structured to provide a holistic and granular view of your site's SEO performance. Key fields include:
_id: Unique identifier for the audit report.auditId: A unique ID for the specific audit run.siteUrl: The root URL of the audited website.timestamp: Date and time of the audit completion.auditType: ("onDemand" or "scheduled").overallStatus: ("success", "warnings", "criticalIssues").summary: * totalPagesCrawled: Number of unique pages audited.
* issuesFound: Total count of SEO issues across all pages.
* criticalIssues: Count of high-priority issues (e.g., missing H1, broken canonicals).
* warnings: Count of medium-priority issues (e.g., missing alt text on minor images).
* pagesWithIssues: List of URLs with at least one issue.
pageReports: An array of objects, each representing an audited page: * url: The specific URL of the page.
* status: HTTP status code (e.g., 200, 404).
* seoChecks: An array of detailed check results for each of the 12 points:
* checkName: (e.g., "Meta Title Uniqueness", "H1 Presence", "Core Web Vitals - LCP").
* status: ("pass", "fail", "warning", "notApplicable").
* details: Specific findings, values, or reasons for pass/fail.
* issueDescription: Human-readable description of the problem if status is "fail" or "warning".
* geminiFix: (Optional) The exact fix generated by Gemini for the issue, including code snippets or detailed instructions.
* beforeState: (Optional) The original problematic code/value before the fix.
* afterState: (Optional) The recommended corrected code/value.
beforeAfterDiff: A high-level comparison to the previous audit report: * previousAuditId: Reference to the _id of the last audit.
* issuesResolved: Count of issues fixed since the last audit.
* newIssuesFound: Count of new issues identified.
* metricChanges: Key metric changes (e.g., average LCP improvement/decline).
rawData: (Optional) Raw data output from Puppeteer, Lighthouse, etc., for deep debugging.before/after DiffThe integration of a beforeAfterDiff within each SiteAuditReport is a critical feature. It allows for:
Upon completion of this step, your audit data is immediately available:
SiteAuditReport will be accessible through your dedicated PantheraHive dashboard, offering visual summaries, detailed page-level reports, and the Gemini-generated fixes.SiteAuditReport data can be accessed directly via the PantheraHive API, allowing for custom integrations and data analysis.This workflow is designed for continuous SEO monitoring:
This step marks the successful completion of the "Site SEO Auditor" workflow. Your website's SEO audit has been fully processed, and the results are securely stored and ready for review.
Next Recommended Actions:
SiteAuditReport.beforeAfterDiff to monitor the impact of your SEO efforts in subsequent audit reports.We are committed to providing you with actionable insights to continuously improve your website's search engine visibility and performance.
\n