Urgent Alert System

Applied Materials - Internal Tool

Problem

Critical lab alerts were often missed or acted on too late. When equipment issues or test failures occurred, the information existed in logs and monitoring systems—but getting it to the right people at the right time was inconsistent. This led to:

Wasted lab time on tests that should have been stopped
Delayed response to equipment issues
Unclear ownership of alert triage

Approach

Built an automated monitoring and escalation service:

Key Components

Event monitoring: Python service polling alert sources on a schedule
Filtering rules: Configurable severity thresholds and affected-device criteria
Escalation logic: Time-based escalation if no acknowledgment
Notification integration: Slack messages for immediate alerts, email summaries for review

Key Decisions

Pull vs Push: Chose polling over webhooks for reliability with existing systems that didn't support push notifications
Rule configurability: Stored rules in config files rather than hardcoding, allowing quick adjustments without deployment
Acknowledgment tracking: Added simple ack mechanism to prevent alert fatigue from repeated notifications

Outcome

Faster response: Critical issues surfaced within minutes instead of hours
Reduced wasted resources: Early detection allowed stopping problematic tests before consuming lab time
Clear ownership: Escalation paths made it obvious who should act on each alert type
Adoption: System became standard part of lab operations workflow

Technologies

Python, Scheduled Tasks (cron), Slack API, Email Integration, JSON Configuration

Media

System architecture diagram coming soon

Urgent Alert System

Urgent Alert System

Problem

Approach

Key Components

Key Decisions

Outcome

Technologies

Media

Related

GPS Engineer