DEC-0027: Online-Source Ingest Hardening and Retry Strategy

Decision

Treat the online-source audit as a discovery success but an ingest-conversion bottleneck.

The next online-source tranche should prioritize:

hardening the ingest pipeline against the concrete blocker classes exposed by wave 1
retrying the blocked wave 1 books before broadening to a larger wave 2
only expecting new essays and deeper outputs from books that successfully ingest and cross the activation threshold into kg-relevant

Context

Perplexity's audit in PR-0146 found a substantial free-text opportunity set, and the operationalization pass converted that into a real ingest backlog.

But the first execution tranche in PR-0150 showed that discovery coverage and ingest readiness are not the same thing:

107 free texts were identified in the audit
99 books were normalized into ingest candidates
20 books were selected for wave 1
only 4 of those 20 were successfully ingested

The main failures were not editorial. They were adapter and source-shape problems:

Sacred Texts request blocking
malformed or truncated normalized URLs
Internet Archive viewer/detail pages mistaken for text sources
partial chapter traversal on Rudolf Steiner Archive
missing PDF extraction support
transient network failures

Two of the four successful ingests became kg-relevant and correctly produced full essay pipelines. The other two remained searchable-internal, which is why they produced activation briefs but not essays.

Strategy

First priority: adapter hardening

The pipeline should learn the source shapes it already encountered rather than assuming more research alone will solve them.

The immediate hardening targets are:

canonical Internet Archive text/download resolution
Sacred Texts fetch behavior or deterministic fallback-source selection
Rudolf Steiner Archive full-book traversal from index pages
PDF extraction for public-domain sources where HTML is unavailable
URL normalization and validation before a candidate reaches execution

Second priority: bounded retry

Retry the blocked wave 1 set after the pipeline is improved.

Do not jump straight to wave 2 until the system can convert a higher fraction of known-good candidates. Otherwise the backlog will expand faster than it can be made real.

Third priority: output expectations

Do not assume that every sourced text should immediately yield essays.

The expected output ladder is:

sourced candidate
successful corpus ingest
activation level assignment
downstream prompts only when the activation level justifies them

Essays should continue to emerge from kg-relevant or more active sources, not from every public-domain text found on the web.

Consequences

Positive

The repo stops confusing "found online" with "operationally ingested."
Retry work becomes measurable and bounded instead of diffuse.
Future essay expectations become structurally honest.

Tradeoffs

Wave 2 breadth is deferred in favor of reliability.
Some books may still require manual source replacement or research backfill.
The ingest toolchain becomes more capable and therefore somewhat more complex.

Implementation Priority

patch the source adapters and preflight validation
retry the 16 blocked wave 1 books
record a retry report with resolved vs. still-blocked causes
only then generate the next ingest wave