Methodology v1.0 | VerityHelm

1. Data Sources

VerityHelm v1.0 queries the following public signal sources. Each source is documented with its access method, limitations, and reliability assessment.

Tier 1 Sources (API-accessible, high reliability)

Source	What It Provides	Access Method	Freshness
SEC/EDGAR	Public company filings: 10-K risk factors, 8-K cybersecurity incidents, auditor changes	REST API (JSON), free, 10 req/sec	Same-day
GitHub Security Advisories	Open-source vulnerability database with CVEs and severity scores	REST API, CC BY 4.0 licensed	Near-real-time
Certificate Transparency (crt.sh)	All TLS/SSL certificates issued for a domain — subdomain enumeration, infrastructure mapping	Web/JSON/PostgreSQL, free	Near-real-time
USPTO Trademarks	Trademark registration status, filing history, related entities	REST API with free API key, 60 req/min	Daily
PCAOB AuditorSearch	Registered audit firms, engagement history, inspection reports	Bulk CSV download (daily updates), free	Daily
Vendor Trust Pages	SOC 3 reports, ISO 27001 certificates, public compliance claims	Web scrape of vendor websites	Varies (annual cycle)

Tier 2 Sources (requires subscription or careful access)

Source	What It Provides	Access Method	Freshness
NASBA CPAverify	CPA license verification across 55 U.S. jurisdictions	Web interface, individual lookups	Varies by state
AICPA Peer Review	Audit firm quality oversight enrollment and results	Web interface, individual lookups	1–3 year review cycles
HaveIBeenPwned	Domain breach exposure history	REST API, paid subscription for domain search	As breaches are verified
UKAS/ANAB Directories	ISO 27001 certifier accreditation status	Web search	As accreditations change
Court Records (PACER/RECAP)	Federal litigation history	PACER (paid, $0.10/page) and RECAP (free archive)	Same-day (PACER)

Tier 3 Sources (supplementary, used with caution)

Source	What It Provides	Access Method	Limitations
DNS/Subdomain History	Historical DNS records, infrastructure changes	SecurityTrails API (paid) or DNSdumpster	Limited free tier
State Corporation Filings	Entity registration, good standing, registered agent	Per-state web interfaces (fragmented)	No unified API; bot protection
Job Posting History	Security team maturity, technology stack signals	Wayback Machine CDX API (free)	Indirect signal; coverage gaps

Sources NOT Used

Paste sites (Pastebin, etc.): Corroborative only, never primary. We do not download or store credential content. If paste content contains PII, it is skipped entirely.
Social media (Twitter/X, LinkedIn posts): Not used as primary signals due to unreliability. May be used to corroborate findings from authoritative sources.
Confidential SOC 2 Type II reports: We do not access, request, or process confidential audit reports in v1.0. Engine inputs are limited to public signals (SOC 3 summaries, trust pages, public attestation claims).

2. Collection Method

2.1 Signal Collection

Each data source is queried using deterministic scripts (not AI/LLM agents). The collection process:

Vendor identification: Vendor name → domain resolution (common patterns: vendor.com, vendor.io, etc.)
Parallel signal collection: Each source is queried independently with the vendor's domain or name as input
Structured output: Each source produces a JSON intermediate file with source name, URL, query timestamp (UTC), raw data retrieved, and error state
Rate limit compliance: All queries respect published rate limits and robots.txt directives

2.2 Rate Limits Observed

Source	Published Limit	Our Policy
SEC/EDGAR	10 req/sec	Max 5 req/sec (50% of limit)
GitHub Advisories	60 req/hr (unauth)	Max 30 req/hr
crt.sh	Fair use	Max 1 req/5 sec
USPTO	60 req/min	Max 30 req/min
CourtListener	Published limits	Respect published limits
All others	Fair use	Minimum 2 sec between requests

2.3 Error Handling

If a source is unavailable:

The finding report notes the source as "unavailable at query time"
No findings are generated from unavailable sources
The overall analysis continues with available sources
The "Signal Freshness" section of the report reflects which sources returned data

3. Claim Extraction

3.1 Sources of Claims

Vendor compliance claims are extracted from:

Trust pages: Vendor-operated security/trust/compliance pages
SOC 3 reports: Publicly distributed audit summaries
Company websites: Security sections, compliance sections
Third-party trust centers: Vanta, SafeBase, Whistic, Drata hosted trust portals

3.2 Extraction Method

Claims are extracted using deterministic pattern matching:

Certification identification: Regex patterns match certifications (SOC 2, ISO 27001, HIPAA, GDPR, PCI DSS, FedRAMP, CSA STAR, CCPA, SOX)
Audit firm identification: Regex patterns match known audit firm names and common phrases ("audited by," "examined by," "certified by")
Security claims: Pattern matching extracts statements about encryption, monitoring, testing, and other security practices
No AI interpretation at this step: All extraction is regex-based. The patterns are versioned with this methodology document.

3.3 What We Do NOT Extract

We do not extract claims from confidential SOC 2 Type II reports
We do not extract claims from NDA-gated trust portals (we only access publicly visible content)
We do not infer claims that are not explicitly stated on vendor pages

4. Cross-Reference Logic

4.1 Signal-to-Claim Matching

Cross-referencing uses deterministic rules that compare public signals against extracted claims:

Claim Type	Signal Source	Cross-Reference Rule
"Zero security incidents"	GitHub Advisories, HaveIBeenPwned, SEC 8-K	If high/critical advisories or breach records exist during the claimed audit period, flag as contradiction
"Continuous monitoring"	CT Logs (subdomain count)	If subdomain count exceeds 50, flag as gap — question whether monitoring covers all infrastructure
Certification claims	AICPA Peer Review, CPAverify	Verify audit firm enrollment in peer review program; verify CPA signatory license status
ISO 27001 claims	UKAS/ANAB Directories	Verify certifying body is accredited

4.2 Rule Types

Contradiction: A public signal directly contradicts a vendor claim. Example: vendor claims zero incidents, but HaveIBeenPwned shows their domain in a breach database during the audit period.
Gap: A public signal raises a question that the vendor claim does not address. Example: 200 subdomains discovered but monitoring claims don't specify scope.
Observation: A public signal is notable but does not directly contradict or gap a specific claim. Example: SEC 8-K filings exist that may contain cybersecurity incident disclosures.

4.3 Temporal Matching

All cross-references are time-aware:

Signals are matched to the vendor's most recent audit period (if identifiable from SOC 3 or trust page)
If the audit period is not identifiable, signals from the most recent 12 months are used
The report notes when temporal alignment could not be verified

5. Contradiction Detection

5.1 What Constitutes a Contradiction

A finding is classified as a contradiction when ALL of the following are true:

The vendor makes a specific, verifiable claim (e.g., "zero security incidents in the audit period")
A public signal from an authoritative source directly conflicts with that claim (e.g., HIBP shows a breach record for the vendor's domain during the same period)
The conflict is unambiguous — there is no reasonable interpretation that reconciles both the claim and the signal

5.2 What Constitutes a Gap

A finding is classified as a gap when:

The vendor makes a broad claim (e.g., "continuous monitoring")
Public signals suggest the claim may be incomplete but do not directly contradict it (e.g., large infrastructure footprint that may exceed monitoring coverage)

5.3 What Does NOT Constitute a Finding

A vendor not having a trust page (absence of evidence is not evidence of absence)
A vendor using a compliance automation platform (this is standard practice)
A vendor's audit firm not being in our "known" list (we flag for investigation, not as a finding)
Public signals that are ambiguous or could have multiple interpretations

5.4 False Positive Expectations

VerityHelm v1.0 is calibrated for low false positive rate at the cost of higher false negative rate. We prefer to miss findings rather than report incorrect ones. Expected rates:

False positive rate: <5% of reported findings
False negative rate: ~40–60% of actual issues (many compliance issues are not detectable from public signals)

6. Known Limitations

6.1 Coverage Limitations

Private companies: Limited SEC/EDGAR data. Analysis primarily relies on trust pages, CT logs, GitHub, and court records.
Non-US companies: NASBA CPAverify, PACER, and state filings are US-only. International coverage requires different signal sources.
Small/early-stage companies: May have minimal public signal footprint. Analysis may return few or no findings.
Vendors without trust pages: If no public compliance claims are found, cross-referencing is not possible.

6.2 Methodology Limitations

No access to confidential reports: SOC 2 Type II reports are not used in v1.0. This means we cannot verify specific control descriptions or test procedures.
Deterministic pattern matching: Regex-based claim extraction may miss non-standard phrasings. Complex or nuanced claims may not be extracted.
Temporal alignment: Audit period dates are not always publicly available, limiting precision of temporal cross-referencing.
Auditor quality assessment is indirect: We can verify peer review enrollment and CPA license status, but we cannot assess the quality of the audit work itself from public signals.

6.3 Categories of Vendors Poorly Served

Private companies with minimal web presence
Companies operating primarily outside the US
Companies that do not publish any compliance information publicly
Infrastructure-level vendors (IaaS, PaaS) whose compliance posture is documented in separate compliance portals with different URL patterns

6.4 What the Methodology Cannot Detect

Fabricated evidence within confidential audit reports (requires access to the report)
Auditor capture or independence issues (requires insight into auditor-client relationship economics)
Internal compliance program effectiveness (requires internal access)
Social engineering susceptibility (requires active testing, which we do not perform)
Accuracy of specific technical controls (requires technical assessment, which we do not perform)

7. Version History

Version	Date	Changes	Backward Compatible
v1.0	2026-04-06	Initial release. 14 public signal sources. Deterministic pipeline. No scoring — findings only.	N/A (initial)

Planned for v1.1

Additional signal sources (paste-site corroborative signals, WHOIS history)
Improved temporal matching with audit period extraction from SOC 3 PDFs
Expanded audit firm database

Planned for v2.0 (post-legal review)

Optional Defensibility Score (0–100, weighted composite of findings)
SOC 2 Type II report metadata ingestion (control descriptions, audit firm, dates — not full report)
Continuous monitoring mode (weekly signal refresh)

8. Pipeline Architecture

INPUT: Vendor Name
  │
  ├─ Step 1: Vendor Profile Assembly
  │   └─ Queries: SEC/EDGAR, CT Logs, GitHub Advisories,
  │              PCAOB, USPTO, Wayback Machine, CourtListener
  │   └─ Output: 01-vendor-profile.json
  │
  ├─ Step 2: Claim Extraction
  │   └─ Scans: Trust pages, third-party trust centers,
  │            SOC 3 download URLs
  │   └─ Extracts: Certifications, audit firm, security claims
  │   └─ Output: 02-claims.json
  │
  ├─ Step 3: Cross-Reference Analysis
  │   └─ Matches: Public signals against extracted claims
  │   └─ Classifies: Contradiction / Gap / Observation
  │   └─ Output: 03-cross-references.json
  │
  ├─ Step 4: Fraud Pattern Match
  │   └─ Checks: Auditor legitimacy, certification speed,
  │             infrastructure scope, breach history
  │   └─ Output: 04-pattern-matches.json
  │
  └─ Step 5: Report Generation
      └─ Assembles: All findings into structured report
      └─ Includes: Subject, methodology disclosure, findings,
      │           questions, signal freshness, summary
      └─ Output: findings-report.md

NOTE: Steps 1–4 are fully deterministic (scripts, regex, API queries).
      Step 5 assembles structured data into report format.
      LLM interpretation is available as an optional enhancement
      in future versions.

This methodology document is versioned and published at verityhelm.com/methodology. Any changes result in a version increment documented in the Version History section. Findings reports reference the specific methodology version under which they were produced.

VerityHelm Methodology