Skip to main content
Choose your view:

Urgent Alert System

2023-06-01
automationdatasystems

Urgent Alert System

Applied Materials - Internal Tool

Problem

Critical lab alerts were often missed or acted on too late. When equipment issues or test failures occurred, the information existed in logs and monitoring systems—but getting it to the right people at the right time was inconsistent. This led to:

  • Wasted lab time on tests that should have been stopped
  • Delayed response to equipment issues
  • Unclear ownership of alert triage

Approach

Built an automated monitoring and escalation service:

Key Components

  • Event monitoring: Python service polling alert sources on a schedule
  • Filtering rules: Configurable severity thresholds and affected-device criteria
  • Escalation logic: Time-based escalation if no acknowledgment
  • Notification integration: Slack messages for immediate alerts, email summaries for review

Key Decisions

  • Pull vs Push: Chose polling over webhooks for reliability with existing systems that didn't support push notifications
  • Rule configurability: Stored rules in config files rather than hardcoding, allowing quick adjustments without deployment
  • Acknowledgment tracking: Added simple ack mechanism to prevent alert fatigue from repeated notifications

Outcome

  • Faster response: Critical issues surfaced within minutes instead of hours
  • Reduced wasted resources: Early detection allowed stopping problematic tests before consuming lab time
  • Clear ownership: Escalation paths made it obvious who should act on each alert type
  • Adoption: System became standard part of lab operations workflow

Technologies

Python, Scheduled Tasks (cron), Slack API, Email Integration, JSON Configuration

Media

System architecture diagram coming soon