← Project Log
DecisionDEC-0027

Online-Source Ingest Hardening and Retry Strategy

humanStatus: accepted

DEC-0027: Online-Source Ingest Hardening and Retry Strategy

Decision

Treat the online-source audit as a discovery success but an ingest-conversion bottleneck.

The next online-source tranche should prioritize:

  1. hardening the ingest pipeline against the concrete blocker classes exposed by wave 1
  2. retrying the blocked wave 1 books before broadening to a larger wave 2
  3. only expecting new essays and deeper outputs from books that successfully ingest and cross the activation threshold into kg-relevant

Context

Perplexity's audit in PR-0146 found a substantial free-text opportunity set, and the operationalization pass converted that into a real ingest backlog.

But the first execution tranche in PR-0150 showed that discovery coverage and ingest readiness are not the same thing:

  • 107 free texts were identified in the audit
  • 99 books were normalized into ingest candidates
  • 20 books were selected for wave 1
  • only 4 of those 20 were successfully ingested

The main failures were not editorial. They were adapter and source-shape problems:

  • Sacred Texts request blocking
  • malformed or truncated normalized URLs
  • Internet Archive viewer/detail pages mistaken for text sources
  • partial chapter traversal on Rudolf Steiner Archive
  • missing PDF extraction support
  • transient network failures

Two of the four successful ingests became kg-relevant and correctly produced full essay pipelines. The other two remained searchable-internal, which is why they produced activation briefs but not essays.

Strategy

First priority: adapter hardening

The pipeline should learn the source shapes it already encountered rather than assuming more research alone will solve them.

The immediate hardening targets are:

  • canonical Internet Archive text/download resolution
  • Sacred Texts fetch behavior or deterministic fallback-source selection
  • Rudolf Steiner Archive full-book traversal from index pages
  • PDF extraction for public-domain sources where HTML is unavailable
  • URL normalization and validation before a candidate reaches execution

Second priority: bounded retry

Retry the blocked wave 1 set after the pipeline is improved.

Do not jump straight to wave 2 until the system can convert a higher fraction of known-good candidates. Otherwise the backlog will expand faster than it can be made real.

Third priority: output expectations

Do not assume that every sourced text should immediately yield essays.

The expected output ladder is:

  • sourced candidate
  • successful corpus ingest
  • activation level assignment
  • downstream prompts only when the activation level justifies them

Essays should continue to emerge from kg-relevant or more active sources, not from every public-domain text found on the web.

Consequences

Positive

  • The repo stops confusing "found online" with "operationally ingested."
  • Retry work becomes measurable and bounded instead of diffuse.
  • Future essay expectations become structurally honest.

Tradeoffs

  • Wave 2 breadth is deferred in favor of reliability.
  • Some books may still require manual source replacement or research backfill.
  • The ingest toolchain becomes more capable and therefore somewhat more complex.

Implementation Priority

  1. patch the source adapters and preflight validation
  2. retry the 16 blocked wave 1 books
  3. record a retry report with resolved vs. still-blocked causes
  4. only then generate the next ingest wave
0:00
0:00