Skip to content

Conversation

@suvodeep-pyne
Copy link
Contributor

@suvodeep-pyne suvodeep-pyne commented Jan 27, 2026

Summary

  • Add TenantTagReadinessCallback to verify broker has valid tenant tags (not just broker_untagged)
  • Add RoutingReadinessCallback to verify routing exists for all online tables before reporting healthy
  • Register both callbacks in BaseBrokerStarter.registerServiceStatusHandler() with documented ordering dependency
  • Use ExternalView (actual state) instead of IdealState (desired state) in RoutingReadinessCallback, following the Helix layer separation principle

Background

This addresses a production issue where queries could be routed to brokers that were not yet ready to serve traffic, causing query failures. The root cause is that the existing health check only validates Helix state convergence but doesn't verify:

  1. The broker has valid tenant tags assigned
  2. Routing tables are built for assigned tables

Test plan

  • Unit tests for TenantTagReadinessCallback (4 tests)
  • Unit tests for RoutingReadinessCallback (6 tests including ExternalView API verification)
  • Integration test in HelixBrokerStarterTest

This change ensures that broker health checks take tenant tags and routing
readiness into account before reporting the broker as healthy.

Two new ServiceStatusCallback implementations are added:
- TenantTagReadinessCallback: Verifies the broker has valid tenant tags
  (not just broker_untagged)
- RoutingReadinessCallback: Verifies routing exists for all assigned tables

This addresses a production issue where queries could be routed to brokers
that were not yet ready to serve traffic, causing query failures.

The callbacks are registered before the IdealState/ExternalView match
callbacks, with TenantTagReadinessCallback preceding RoutingReadinessCallback
since an untagged broker has no tables assigned.
@suvodeep-pyne suvodeep-pyne changed the title Add health check callbacks for tenant tags and routing readiness [broker] Add health check callbacks for tenant tags and routing readiness Jan 27, 2026
@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 85.54217% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.16%. Comparing base (234d382) to head (55614c8).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...roker/broker/helix/TenantTagReadinessCallback.java 80.00% 3 Missing and 4 partials ⚠️
.../broker/broker/helix/RoutingReadinessCallback.java 89.58% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17580      +/-   ##
============================================
- Coverage     63.21%   63.16%   -0.05%     
  Complexity     1476     1476              
============================================
  Files          3172     3174       +2     
  Lines        189806   189889      +83     
  Branches      29046    29065      +19     
============================================
- Hits         119987   119952      -35     
- Misses        60508    60611     +103     
- Partials       9311     9326      +15     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.14% <85.54%> (-0.01%) ⬇️
java-21 63.13% <85.54%> (-0.04%) ⬇️
temurin 63.16% <85.54%> (-0.05%) ⬇️
unittests 63.16% <85.54%> (-0.05%) ⬇️
unittests1 55.52% <ø> (-0.02%) ⬇️
unittests2 34.05% <85.54%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…lState

ExternalView represents the actual cluster state (what Helix participants
have reported), while IdealState represents the desired state. Participants
should use ExternalView to determine their responsibilities, following the
Helix layer separation principle.
@suvodeep-pyne suvodeep-pyne marked this pull request as draft January 29, 2026 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants