# Claude Code Prompt: Notion Webhook Reliability Implementation

## Context

Notion paused webhook delivery to our endpoint because it could not confirm successful delivery for an 8-hour window. The endpoint is:

```
https://openclaws-mac-mini.tail123813.ts.net/webhook/notion
```

The local event-router is healthy. The confirmed gap is that we have no per-request acknowledgment logging — we can see events being processed, but we cannot prove every inbound Notion POST received a 2xx response. Notion pauses delivery after repeated unacknowledged requests.

This prompt implements three targeted fixes:
1. Per-request RECV/ACK logging on the webhook endpoint
2. A proactive external health check cron job
3. A Notion thread entry documenting the incident and changes

Do not restart any services or modify routing/delivery logic. Scope is strictly logging and monitoring additions.

---

## Pre-Flight: Discovery Phase

Before writing any code, run the following and capture all output. Every subsequent step depends on this evidence.

```bash
# 1. Locate the router process and source
ps aux | grep -E "(router|webhook|event)" | grep -v grep

# 2. Find the router source directory
find /Users/openclaw -name "*.js" -path "*/router*" 2>/dev/null | head -20
find /Users/openclaw -name "*.ts" -path "*/router*" 2>/dev/null | head -20
find /Users/openclaw -name "*.js" -path "*/webhook*" 2>/dev/null | head -20

# 3. Check if router is dist/compiled or editable source
ls -la /Users/openclaw/.openclaw/
ls -la /Users/openclaw/.openclaw/workspace/ 2>/dev/null
find /Users/openclaw -name "package.json" -not -path "*/node_modules/*" 2>/dev/null | head -10

# 4. Locate all log files
ls -lht /Users/openclaw/.openclaw/logs/ 2>/dev/null
find /Users/openclaw -name "router*.log" 2>/dev/null
find /Users/openclaw -name "*.log" -not -path "*/node_modules/*" 2>/dev/null | head -20

# 5. Check current log format — does it include HTTP response codes?
tail -50 /Users/openclaw/.openclaw/logs/router.log 2>/dev/null || \
  find /Users/openclaw -name "router.log" -exec tail -50 {} \;

# 6. Check existing cron jobs
openclaw cron list

# 7. Check router health endpoint
curl -sf http://127.0.0.1:8080/health

# 8. Check Tailscale status
tailscale status 2>/dev/null || echo "tailscale not in PATH"

# 9. Fetch the Notion thread for this incident (Threads DB)
# Threads DB ID: d75c3749-1459-4225-a645-d48671dae765
```

**Stop here. Report all findings before proceeding.**

---

## Step 1 — Find the Failure Window

Using the logs located in the discovery phase, find the 8-hour failure window Notion reported.

```bash
# Find the gap in Notion webhook traffic
# Adjust log path based on discovery results
LOG_PATH="/Users/openclaw/.openclaw/logs/router.log"

# Last 200 notion-related entries with timestamps
grep -i "notion\|webhook" "$LOG_PATH" | tail -200

# Look for any non-2xx or error markers
grep -iE "(error|4[0-9]{2}|5[0-9]{2}|timeout|fail|refused)" "$LOG_PATH" | tail -50

# Find timestamp of last entry before the gap
# Notion's email said "past 8 hours" — search backward from now
grep -i "notion" "$LOG_PATH" | \
  awk '{print $1}' | sort | uniq | tail -30
```

**Document:**
- Timestamp of last successful Notion event before the pause
- Timestamp of first Notion event after the resume
- Whether the log contains HTTP response codes (2xx/4xx/5xx) or only event processing records
- Whether there is a gap in inbound traffic or just a gap in forwarded events

---

## Step 2 — Add Per-Request RECV/ACK Logging

**Only proceed if the router source is editable (not a compiled dist file).**

If the router is a dist/compiled file, skip to Step 2-B.

### Step 2-A: Source-Based Patch

Locate the webhook route handler. It will be the function that receives POST requests at `/webhook/notion`. Add logging middleware immediately before it.

The middleware must log two structured JSON lines per request:

**On receipt:**
```json
{
  "level": "RECV",
  "source": "notion",
  "requestId": "<x-notion-signature or timestamp>",
  "method": "POST",
  "path": "/webhook/notion",
  "ts": "<ISO timestamp>"
}
```

**On response send:**
```json
{
  "level": "ACK",
  "source": "notion",
  "requestId": "<same as RECV>",
  "status": 200,
  "elapsed": 45,
  "ts": "<ISO timestamp>"
}
```

**Node.js/Express pattern:**
```javascript
app.use('/webhook/notion', (req, res, next) => {
  const requestId = req.headers['x-notion-signature'] ||
                    req.headers['x-request-id'] ||
                    `req-${Date.now()}`;
  const start = Date.now();

  // Skip logging for internal health probes
  if (req.headers['x-health-probe'] === 'true') {
    return next();
  }

  console.log(JSON.stringify({
    level: 'RECV',
    source: 'notion',
    requestId,
    method: req.method,
    path: req.path,
    ts: new Date().toISOString()
  }));

  const originalSend = res.send.bind(res);
  res.send = function(body) {
    console.log(JSON.stringify({
      level: 'ACK',
      source: 'notion',
      requestId,
      status: res.statusCode,
      elapsed: Date.now() - start,
      ts: new Date().toISOString()
    }));
    return originalSend(body);
  };

  next();
});
```

After patching, verify the syntax is valid:
```bash
node --check <router-file-path>
```

**Do not restart the router yet. Flag for Braden to confirm restart.**

### Step 2-B: If Router is Compiled/Dist

If the router source is not directly editable, document:
- The exact file path of the compiled router
- The OpenClaw version
- Whether OpenClaw has a middleware hook or plugin system for adding request interceptors

Do not attempt to patch a compiled file. Flag this as a blocker requiring a different approach.

---

## Step 3 — Add External Health Check Cron

This cron job probes the actual public Tailscale URL — not localhost — to detect reachability failures before Notion does.

```bash
# First verify the endpoint responds from localhost via the public URL path
curl -sf -o /dev/null -w "%{http_code}" \
  -X POST https://openclaws-mac-mini.tail123813.ts.net/webhook/notion \
  -H "Content-Type: application/json" \
  -H "x-health-probe: true" \
  -d '{"type":"health-check","source":"internal-probe"}'
```

If the above returns 200 or 400 (router received it), the endpoint is reachable. If it times out or returns 000, that is likely the root cause of Notion's failure.

**Create the cron job:**

```bash
# Check available cron syntax first
openclaw cron --help

# Create health check cron (every 10 minutes)
openclaw cron add \
  --name "notion-webhook-health" \
  --schedule "*/10 * * * *" \
  --model "openai-codex/gpt-5.4-mini" \
  --command 'bash -c '\''
    STATUS=$(curl -sf -o /dev/null -w "%{http_code}" \
      --max-time 10 \
      -X POST https://openclaws-mac-mini.tail123813.ts.net/webhook/notion \
      -H "Content-Type: application/json" \
      -H "x-health-probe: true" \
      -d "{\"type\":\"health-check\"}" 2>/dev/null)
    if [[ "$STATUS" != 2* ]]; then
      echo "NOTION_WEBHOOK_HEALTH_FAIL status=$STATUS ts=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
        >> /Users/openclaw/.openclaw/logs/webhook-health.log
    else
      echo "NOTION_WEBHOOK_HEALTH_OK status=$STATUS ts=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
        >> /Users/openclaw/.openclaw/logs/webhook-health.log
    fi
  '\'''
```

**If `openclaw cron add` does not support `--command` with shell scripts**, fall back to a LaunchAgent:

```bash
# Write health check script
cat > /Users/openclaw/.openclaw/scripts/notion-webhook-health.sh << 'EOF'
#!/bin/bash
LOG="/Users/openclaw/.openclaw/logs/webhook-health.log"
STATUS=$(curl -sf -o /dev/null -w "%{http_code}" \
  --max-time 10 \
  -X POST https://openclaws-mac-mini.tail123813.ts.net/webhook/notion \
  -H "Content-Type: application/json" \
  -H "x-health-probe: true" \
  -d '{"type":"health-check"}' 2>/dev/null)

TS=$(date -u +%Y-%m-%dT%H:%M:%SZ)
if [[ "$STATUS" == 2* ]]; then
  echo "NOTION_WEBHOOK_HEALTH_OK status=$STATUS ts=$TS" >> "$LOG"
else
  echo "NOTION_WEBHOOK_HEALTH_FAIL status=$STATUS ts=$TS" >> "$LOG"
  # Notify via OpenClaw if available
  openclaw notify --channel telegram \
    --message "ALERT: Notion webhook endpoint returned $STATUS at $TS. Check before Notion pauses delivery." \
    2>/dev/null || true
fi
EOF

chmod +x /Users/openclaw/.openclaw/scripts/notion-webhook-health.sh

# Test it manually first
/Users/openclaw/.openclaw/scripts/notion-webhook-health.sh
cat /Users/openclaw/.openclaw/logs/webhook-health.log
```

Verify the cron was created:
```bash
openclaw cron list
```

---

## Step 4 — Verify Router Always Returns 2xx to Notion

Regardless of whether the router decides to forward/ignore/suppress the event downstream, it must return 2xx to Notion on every valid inbound POST. This is Notion's only signal that delivery succeeded.

Verify this in the router code:

```bash
# Search for the notion webhook handler return/response logic
grep -n "res.send\|res.json\|res.status\|statusCode\|sendStatus" <router-file-path>

# Specifically look for any path where no response is sent
# (missing return, unhandled promise rejection, uncaught exception before send)
grep -n -A 5 "webhook/notion\|/notion" <router-file-path> | head -60
```

**Flag any code path where:**
- An exception could be thrown before `res.send()` is called
- An async function could reject without a catch that sends a response
- The router returns non-2xx for events it decides to ignore

If found, document the specific line and the fix required. Do not patch without flagging first.

---

## Step 5 — Notion Thread Update

Post a session-orientation notice to the Notion Threads database to log this incident.

Use the Notion MCP to append to the relevant thread, or create a new thread entry if none exists for this incident.

**Thread entry content:**

```
Author: Claude | Date: 2026-04-07 | Time: [current time] | TZ: America/Los_Angeles

INCIDENT: Notion webhook delivery paused
Endpoint: https://openclaws-mac-mini.tail123813.ts.net/webhook/notion
Reported: 8-hour delivery failure window ending ~2026-04-05

Root cause diagnosis:
- Router was healthy throughout
- No per-request ACK logging existed — could not prove every POST received 2xx
- Exact failure window not yet confirmed (Step 1 findings required)

Changes implemented:
- [ ] Per-request RECV/ACK logging added to router (Step 2)
- [ ] External webhook health check cron added, 10-min interval (Step 3)
- [ ] Router 2xx guarantee verified (Step 4)

Status: Implementation complete pending Braden review
Decision needed: Confirm router restart to activate logging changes
Next step: Monitor webhook-health.log after Notion resume; confirm no further pauses
```

---

## Deliverable Format

Return findings as a structured report:

### 1. Discovery Summary
- Router source location (editable vs compiled)
- Log format (includes HTTP status or event-processing only)
- Failure window timestamps (if recoverable from logs)
- External URL reachability result from Step 3 probe

### 2. Changes Made
For each step: what was done, exact file paths modified, commands run

### 3. Changes NOT Made (and why)
Anything blocked, skipped, or requiring Braden decision

### 4. Verification Results
- Health check cron confirmed in `openclaw cron list`
- Manual health probe result (HTTP status)
- Router syntax check result (if source was patched)

### 5. Open Items for Braden
- Router restart confirmation (if source was patched)
- Any blocker from Step 2-B (compiled router)
- Any code paths found in Step 4 that need a fix

---

## Constraints

- Do not restart the router or any services without explicit confirmation
- Do not modify routing/delivery/filter logic — logging only
- Do not modify OpenClaw config values during this implementation
- Do not expose any new ports or services
- If any step is ambiguous, stop and report before proceeding
- All file modifications must be reported with exact path and diff
