Urgent Alert System
2023-06-01
automationdatasystems
Urgent Alert System
Applied Materials - Internal Tool
Problem
Critical lab alerts were often missed or acted on too late. When equipment issues or test failures occurred, the information existed in logs and monitoring systems—but getting it to the right people at the right time was inconsistent. This led to:
- Wasted lab time on tests that should have been stopped
- Delayed response to equipment issues
- Unclear ownership of alert triage
Approach
Built an automated monitoring and escalation service:
Key Components
- Event monitoring: Python service polling alert sources on a schedule
- Filtering rules: Configurable severity thresholds and affected-device criteria
- Escalation logic: Time-based escalation if no acknowledgment
- Notification integration: Slack messages for immediate alerts, email summaries for review
Key Decisions
- Pull vs Push: Chose polling over webhooks for reliability with existing systems that didn't support push notifications
- Rule configurability: Stored rules in config files rather than hardcoding, allowing quick adjustments without deployment
- Acknowledgment tracking: Added simple ack mechanism to prevent alert fatigue from repeated notifications
Outcome
- Faster response: Critical issues surfaced within minutes instead of hours
- Reduced wasted resources: Early detection allowed stopping problematic tests before consuming lab time
- Clear ownership: Escalation paths made it obvious who should act on each alert type
- Adoption: System became standard part of lab operations workflow
Technologies
Python, Scheduled Tasks (cron), Slack API, Email Integration, JSON Configuration
Media
System architecture diagram coming soon