Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Nov 27, 2025

Overview

Resolves #2609

The basic idea is to store an additional DAHightHint field within the p2p store.

As SignedHeader and Data are used in other places, too. I added an an DAHeightHintContainer type to wrap the tuple for the store only.
The DAHeight Hint is added by the da_submitter and read in the syncer to fetch the missing DA header/data for the most recent block as required.

Please note: this is a breaking change to the p2p network and store

@claude
Copy link
Contributor

claude bot commented Nov 27, 2025

Claude finished @alpe's task in 2m 48s —— View job


Comprehensive Code Review

I've completed a thorough review of the DA hints implementation. Here are my findings organized by priority and category:


🔴 Critical Issues

1. Resource Leak: AsyncDARetriever Not Stopped

Location: block/internal/syncing/syncer.go:239-276

The AsyncDARetriever is started in Syncer.Start() (line 208) but never stopped in Syncer.Stop(). This causes goroutine leaks.

Impact: Each syncer restart leaks 5 worker goroutines indefinitely.

Fix Required:

// In Syncer.Stop(), before s.wg.Wait():
if s.asyncDARetriever != nil {
    s.asyncDARetriever.Stop()
}

Fix this →


2. Channel Not Closed on Async Retriever Stop

Location: block/internal/syncing/async_da_retriever.go:50-57

The workCh channel is never closed when the retriever stops, which could cause goroutines to hang if they're trying to send to it.

Current Code:

func (r *AsyncDARetriever) Stop() {
    if r.cancel != nil {
        r.cancel()
    }
    r.wg.Wait()
    // workCh is never closed!
}

Fix Required: Close workCh after all workers have exited to prevent potential hangs.

Fix this →


3. Unbounded Retry Loop in SetProcessedHeight

Location: block/internal/syncing/p2p_handler.go:62-72

The retry loop has a hardcoded limit of 1000 iterations with no backoff or error handling. This is a code smell that could hide CAS contention issues.

Current Code:

func (h *P2PHandler) SetProcessedHeight(height uint64) {
    for range 1_000 {  // Magic number, no error handling
        current := h.processedHeight.Load()
        if height <= current {
            return
        }
        if h.processedHeight.CompareAndSwap(current, height) {
            return
        }
    }
    // Silently fails after 1000 attempts!
}

Issues:

  • No logging if all 1000 attempts fail
  • No runtime.Gosched() to yield CPU during contention
  • Magic number without justification

Recommendation: Add logging if the loop exhausts, or consider if this pattern is necessary at all (CAS should succeed quickly in normal conditions).


🟡 High Priority Issues

4. Missing Test Coverage for Binary Compatibility Edge Cases

Location: types/p2p_envelope_test.go

The backward compatibility tests (TestSignedHeaderBinaryCompatibility and TestDataBinaryCompatibility) are excellent, but they don't test:

  • What happens when an old node (using SignedHeader) receives data from a new node (using P2PSignedHeader with DA hints)?
  • What happens when da_height_hint is 0 vs unset (both will be 0 in Go)?
  • Forward compatibility: Can new code handle legacy data?

Coverage Gap: The tests verify that P2P types can unmarshal old formats, but don't verify all edge cases around the optional field semantics.

Recommendation: Add tests for:

// Test: Old format → New format → Old format round-trip
// Test: What happens with da_height_hint = 0 (valid DA height)
// Test: Partial compatibility scenarios

5. Potential Race in AsyncDARetriever.RequestRetrieval

Location: block/internal/syncing/async_da_retriever.go:61-76

The pattern of checking inFlight and adding to it is racy with the cleanup in processRetrieval:

// RequestRetrieval - adds to inFlight under lock
func (r *AsyncDARetriever) RequestRetrieval(height uint64) {
    r.mu.Lock()
    defer r.mu.Unlock()
    if _, exists := r.inFlight[height]; exists {
        return
    }
    // ... send to workCh
    r.inFlight[height] = struct{}{}
}

// processRetrieval - removes from inFlight under lock
defer func() {
    r.mu.Lock()
    delete(r.inFlight, height)
    r.mu.Unlock()
}()

Scenario: If a request completes and deletes from inFlight, then a duplicate request comes in before the channel is drained, it will be queued again.

Assessment: The current implementation handles this correctly because workCh acts as a buffer, but the intent could be clearer. Consider documenting this behavior or adding a test case.


6. Error Handling: Silent Failures in AppendDAHint

Location: block/internal/submitting/da_submitter.go:213-216

if err := s.headerDAHintAppender.AppendDAHint(ctx, res.Height, hashes...); err != nil {
    s.logger.Error().Err(err).Msg("failed to append da height hint in header p2p store")
    // ignoring error here, since we don't want to block the block submission'
}

Concern: While the comment explains the rationale, this means DA hints can be silently lost without any retry mechanism or visibility into how often this happens.

Recommendation:

  • Add a metric to track AppendDAHint failures
  • Consider a background retry mechanism for failed hint updates
  • Document the implications: nodes will fall back to scanning DA without hints

7. Protobuf Field Numbering Concern

Location: proto/evnode/v1/evnode.proto:110-123

The P2PSignedHeader and P2PData messages use optional fields for da_height_hint:

message P2PSignedHeader {
  Header header = 1;
  bytes signature = 2;
  Signer signer = 3;
  optional uint64 da_height_hint = 4;  // Field 4 was reserved in SignedHeader
}

Issue: The use of field number 4, which was reserved in SignedHeader, is intentional for backward compatibility. However, the PR description states this is a "breaking change to the p2p network."

Question: If fields 1-3 are identical and field 4 uses a reserved slot, shouldn't this be backward compatible on the wire format? The tests suggest it is. This contradiction needs clarification.

Recommendation: Update the PR description to accurately reflect the compatibility story, or explain what specifically is breaking.


🟢 Medium Priority Issues

8. Inconsistent Error Logging Levels

Location: block/internal/syncing/async_da_retriever.go:100

r.logger.Debug().Err(err).Uint64("height", height).Msg("async DA retrieval failed")

DA retrieval failures are logged at Debug level, which may make production troubleshooting difficult. Consider whether some failures should be Warn or Error level, or add a metric.


9. Magic Number: Worker Pool Size

Location: block/internal/syncing/async_da_retriever.go:42-43

for i := 0; i < 5; i++ {  // Hardcoded worker count
    r.wg.Add(1)
    go r.worker()
}

Issue: The number 5 is hardcoded without configuration option or justification.

Recommendation: Make this configurable or document why 5 workers is the right number.


10. Buffer Size Selection

Location: block/internal/syncing/async_da_retriever.go:33

workCh: make(chan uint64, 100), // Buffer size 100

Question: What's the rationale for a buffer of 100? This should be documented or made configurable based on expected concurrency.


11. Missing Context Propagation

Location: pkg/sync/sync_service.go:184-198

The AppendDAHint method uses the context passed in, but there's no timeout or cancellation handling specific to this operation. If the context is cancelled mid-operation, the store could be left in an inconsistent state.

Current Code:

func (s *SyncService[V]) AppendDAHint(ctx context.Context, daHeight uint64, hashes ...types.Hash) error {
    entries := make([]V, 0, len(hashes))
    for _, h := range hashes {
        v, err := s.store.Get(ctx, h)  // Could be cancelled here
        // ...
    }
    return s.store.Append(ctx, entries...)  // Or here
}

Recommendation: Consider using a child context with timeout for critical sections, or document that this must be called with a non-cancellable context.


🔵 Code Quality & Best Practices

12. Excellent Binary Compatibility Strategy

Location: types/p2p_envelope.go

The implementation of binary compatibility between SignedHeader/Data and P2PSignedHeader/P2PData is well-designed:

✅ Uses proto field compatibility (fields 1-3 match exactly)
✅ Comprehensive tests for both directions
✅ Handles optional fields correctly

Recommendation: Document this pattern in comments as it's a clever solution to the upgrade problem.


13. Well-Structured Async Retriever

Location: block/internal/syncing/async_da_retriever.go

The worker pool pattern is well-implemented:

  • Proper use of sync.WaitGroup
  • Idempotent request handling via inFlight map
  • Non-blocking request submission with overflow protection

Minor improvement: Add metrics for queue depth and worker utilization.


14. Generic Interface Design

Location: pkg/sync/sync_service.go:39-43

type EntityWithDAHint[H any] interface {
    header.Header[H]
    SetDAHint(daHeight uint64)
    DAHint() uint64
}

Clean use of Go generics to extend the header interface with DA hint capabilities. Well done.


🟣 Testing & Coverage

15. Test Coverage Analysis

Codecov Report: 70.17% patch coverage with 68 lines missing

Key gaps:

  • types/p2p_envelope.go: 53.42% (29 missing, 5 partial)
  • block/internal/syncing/syncer.go: 58.62% (12 missing)
  • pkg/sync/sync_service.go: 69.44% (7 missing, 4 partial)

Critical uncovered paths:

  1. Error handling in MarshalBinary/UnmarshalBinary for unsupported types
  2. AsyncDARetriever worker pool saturation
  3. DA hint append failures and retry logic

Recommendation: Add tests for error paths and edge cases, particularly around the async retriever behavior under high load.


16. Good Test Quality

Location: block/internal/syncing/async_da_retriever_test.go

The tests demonstrate:

  • ✅ Idempotency verification
  • ✅ Worker pool limits
  • ✅ Concurrent request handling

These are well-written and cover the important behavioral aspects.


🟤 Documentation & Maintenance

17. Missing Migration Documentation

Location: PR description and code comments

The PR mentions "breaking change to the p2p network and store" but:

  • No migration guide for existing networks
  • No version compatibility matrix
  • No rollback strategy

From comments: @tac0turtle asked "how would an existing network handle using this feature? its breaking so is it safe to upgrade?" and @alpe responded about binary compatibility, but this needs formal documentation.

Recommendation: Add a migration guide covering:

  • Upgrade sequence for live networks
  • Data migration for p2p stores
  • Backward compatibility windows
  • Rollback procedures

18. Inline Comments for Complex Logic

Location: block/internal/syncing/syncer.go:600-622

The DA hint extraction logic in processHeightEvent is complex but lacks inline comments:

if event.Source == common.SourceP2P {
    var daHeightHints []uint64
    if event.DaHeightHints[0] > 0 && event.DaHeightHints[1] > 0 {
        // Why this logic? Comment needed
        if _, exists := s.cache.GetHeaderDAIncluded(event.Header.Hash().String()); !exists {
            daHeightHints = []uint64{event.DaHeightHints[0]}
        }
        // ... more complex logic
    }
}

Recommendation: Add comments explaining the deduplication logic and why we check cache for DA inclusion.


📊 Summary

Blocking Issues (Must Fix Before Merge)

  1. ✅ Goroutine leak in AsyncDARetriever (not stopped)
  2. ✅ workCh channel not closed on Stop
  3. ⚠️ Clarify breaking change story vs. binary compatibility

Recommended Fixes

  • Add metrics for DA hint append failures
  • Document worker pool sizing decisions
  • Improve test coverage for error paths
  • Add migration documentation

Positive Highlights

  • Binary compatibility strategy is excellent
  • AsyncDARetriever design is solid
  • Generic interface design is clean
  • Comprehensive compatibility tests

Overall Assessment

The implementation is well-designed with a clever approach to backward compatibility. The main concerns are resource management (goroutine leak) and operational visibility (metrics, logging). Once the critical issues are addressed, this will be a solid addition.

Recommendation: Address the resource leak and clarify the breaking change documentation before merging. The CI issues mentioned by @tac0turtle should also be investigated.


@github-actions
Copy link
Contributor

github-actions bot commented Nov 27, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 20, 2026, 1:38 PM

@alpe alpe changed the title Alex/hints feat: DA hints in p2p Nov 28, 2025
@codecov
Copy link

codecov bot commented Nov 28, 2025

Codecov Report

❌ Patch coverage is 70.17544% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.71%. Comparing base (9ad4016) to head (6e4554d).

Files with missing lines Patch % Lines
types/p2p_envelope.go 53.42% 29 Missing and 5 partials ⚠️
block/internal/syncing/syncer.go 58.62% 12 Missing ⚠️
pkg/sync/sync_service.go 69.44% 7 Missing and 4 partials ⚠️
block/internal/syncing/async_da_retriever.go 86.53% 6 Missing and 1 partial ⚠️
block/internal/submitting/da_submitter.go 80.95% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2891      +/-   ##
==========================================
+ Coverage   59.53%   59.71%   +0.17%     
==========================================
  Files         107      109       +2     
  Lines       10075    10261     +186     
==========================================
+ Hits         5998     6127     +129     
- Misses       3447     3497      +50     
- Partials      630      637       +7     
Flag Coverage Δ
combined 59.71% <70.17%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alpe added 3 commits November 28, 2025 17:20
* main:
  refactor: omit unnecessary reassignment (#2892)
  build(deps): Bump the all-go group across 5 directories with 6 updates (#2881)
  chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889)
  fix: ensure consistent network ID usage in P2P subscriber (#2884)
cache.SetHeaderDAIncluded(headerHash.String(), res.Height, header.Height())
hashes[i] = headerHash
}
if err := s.headerDAHintAppender.AppendDAHint(ctx, res.Height, hashes...); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the DA height is passed to the sync service to update the p2p store

Msg("P2P event with DA height hint, triggering targeted DA retrieval")

// Trigger targeted DA retrieval in background via worker pool
s.asyncDARetriever.RequestRetrieval(daHeightHint)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the "fetch from DA" is triggered for the current block event height

type SignedHeaderWithDAHint = DAHeightHintContainer[*types.SignedHeader]
type DataWithDAHint = DAHeightHintContainer[*types.Data]

type DAHeightHintContainer[H header.Header[H]] struct {
Copy link
Contributor Author

@alpe alpe Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a data container to persist the DA hint together with the block header or data.
types.SignedHeader and types.Data are used all over the place so I did not modify them but added introduced this type for the p2p store and transfer only.

It may make sense to do make this a Proto type. WDYT?

return nil
}

func (s *SyncService[V]) AppendDAHint(ctx context.Context, daHeight uint64, hashes ...types.Hash) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stores the DA height hints

@alpe alpe marked this pull request as ready for review December 1, 2025 09:32
@tac0turtle
Copy link
Contributor

if da hint is not in the proto how do other nodes get knowledge of the hint?

also how would an existing network handle using this feature? its breaking so is it safe to upgrade?

"github.com/evstack/ev-node/block/internal/cache"
"github.com/evstack/ev-node/block/internal/common"
"github.com/evstack/ev-node/block/internal/da"
coreda "github.com/evstack/ev-node/core/da"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: gci linter

Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It really makes sense.

I share the same concern as @tac0turtle however about the upgrade strategy given it is p2p breaking.

julienrbrt
julienrbrt previously approved these changes Dec 2, 2025
@alpe
Copy link
Contributor Author

alpe commented Dec 2, 2025

if da hint is not in the proto how do other nodes get knowledge of the hint?

The sync_service wraps the header/data payload in a DAHeightHintContainer object that is passed upstream to the p2p layer. When the DA height is known, the store is updated.

also how would an existing network handle using this feature? its breaking so is it safe to upgrade?

It is a breaking change. Instead of signed header or data types, the p2p network exchanges DAHeightHintContainer. This would be incompatible. Also the existing p2p stores would need migration to work.

@julienrbrt
Copy link
Member

julienrbrt commented Dec 4, 2025

Could we broadcast both until every networks are updated? Then for final we can basically discard the previous one.

@alpe
Copy link
Contributor Author

alpe commented Dec 5, 2025

fyi: This PR is missing a migration strategy for the p2p store ( and ideally network)

* main:
  refactor(sequencers): persist prepended batch (#2907)
  feat(evm): add force inclusion command (#2888)
  feat: DA client, remove interface part 1: copy subset of types needed for the client using blob rpc. (#2905)
  feat: forced inclusion (#2797)
  fix: fix and cleanup metrics (sequencers + block) (#2904)
  build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900)
  refactor(block): centralize timeout in client (#2903)
  build(deps): Bump the all-go group across 2 directories with 3 updates (#2898)
  chore: bump default timeout (#2902)
  fix: revert default db (#2897)
  refactor: remove obsolete // +build tag (#2899)
  fix:da visualiser namespace  (#2895)
alpe added 3 commits December 15, 2025 10:52
* main:
  chore: execute goimports to format the code (#2924)
  refactor(block)!: remove GetLastState from components (#2923)
  feat(syncing): add grace period for missing force txs inclusion (#2915)
  chore: minor improvement for docs (#2918)
  feat: DA Client remove interface part 2,  add client for celestia blob api   (#2909)
  chore: update rust deps (#2917)
  feat(sequencers/based): add based batch time (#2911)
  build(deps): Bump golangci/golangci-lint-action from 9.1.0 to 9.2.0 (#2914)
  refactor(sequencers): implement batch position persistance (#2908)
github-merge-queue bot pushed a commit that referenced this pull request Dec 15, 2025
<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.

NOTE: PR titles should follow semantic commits:
https://www.conventionalcommits.org/en/v1.0.0/
-->

## Overview

Temporary fix until #2891.
After #2891 the verification for p2p blocks will be done in the
background.

ref: #2906

<!-- 
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue. 

Ex: Closes #<issue number>
-->
@alpe
Copy link
Contributor Author

alpe commented Dec 15, 2025

I have added 2 new types for the p2p store that are binary compatible to the types.Data and SignedHeader. With this, we should be able to roll this out without breaking the in-flight p2p data and store.

alpe added 3 commits December 15, 2025 14:49
* main:
  fix(syncing): skip forced txs checks for p2p blocks (#2922)
  build(deps): Bump the all-go group across 5 directories with 5 updates (#2919)
  chore: loosen syncer state check (#2927)
@alpe alpe requested a review from julienrbrt December 15, 2025 15:00
julienrbrt
julienrbrt previously approved these changes Dec 15, 2025
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! I can see how useful the async retriever will be for force inclusion verification as well. We should have @auricom verify if p2p still works with Eden.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be really useful for force inclusion checks as well.

* main:
  build(deps): Bump actions/cache from 4 to 5 (#2934)
  build(deps): Bump actions/download-artifact from 6 to 7 (#2933)
  build(deps): Bump actions/upload-artifact from 5 to 6 (#2932)
  feat: DA Client remove interface part 3, replace types with new code (#2910)
  DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920)
julienrbrt
julienrbrt previously approved these changes Dec 16, 2025
alpe added 3 commits December 19, 2025 17:00
* main:
  feat: use DA timestamp (#2939)
  chore: improve code comments clarity (#2943)
  build(deps): bump libp2p (#2937)
(cherry picked from commit ad3e21b)
julienrbrt
julienrbrt previously approved these changes Dec 19, 2025
* main:
  fix: make evm_execution more robust (#2942)
  fix(sequencers/single): deterministic queue (#2938)
  fix(block): fix init logic sequencer for da epoch fetching (#2926)
github-merge-queue bot pushed a commit that referenced this pull request Jan 2, 2026
Introduce envelope for headers on DA to fail fast on unauthorized
content.
Similar approach as in #2891 with a binary compatible sibling type that
carries the additional information.
 
* Add DAHeaderEnvelope type to wrap signed headers on DA
  * Binary compatible to `SignedHeader` proto type
  * Includes signature of of the plain content
* DARetriever checks for valid signature early in the process
* Supports `SignedHeader` for legacy support until first signed envelope
read
alpe added 2 commits January 8, 2026 10:06
* main:
  chore: fix some minor issues in the comments (#2955)
  feat: make reaper poll duration configurable (#2951)
  chore!: move sequencers to pkg (#2931)
  feat: Ensure Header integrity on DA (#2948)
  feat(testda): add header support with GetHeaderByHeight method (#2946)
  chore: improve code comments clarity (#2947)
  chore(sequencers): optimize store check (#2945)
@tac0turtle
Copy link
Contributor

ci seems to be having some issues, can these be fixed.

Also was this tested on an existing network? If not, please do that before merging

alpe added 6 commits January 19, 2026 09:46
* main:
  fix: inconsistent state detection and rollback (#2983)
  chore: improve graceful shutdown restarts (#2985)
  feat(submitting): add posting strategies (#2973)
  chore: adding syncing tracing (#2981)
  feat(tracing): adding block production tracing (#2980)
  feat(tracing): Add Store, P2P and Config tracing (#2972)
  chore: fix upgrade test (#2979)
  build(deps): Bump github.com/ethereum/go-ethereum from 1.16.7 to 1.16.8 in /execution/evm/test in the go_modules group across 1 directory (#2974)
  feat(tracing): adding tracing to DA client (#2968)
  chore: create onboarding skill  (#2971)
  test: add e2e tests for force inclusion (part 2) (#2970)
  feat(tracing): adding eth client tracing (#2960)
  test: add e2e tests for force inclusion (#2964)
  build(deps): Bump the all-go group across 4 directories with 10 updates (#2969)
  fix: Fail fast when executor ahead (#2966)
  feat(block): async epoch fetching (#2952)
  perf: tune badger defaults and add db bench (#2950)
  feat(tracing): add tracing to EngineClient (#2959)
  chore: inject W3C headers into engine client and eth client (#2958)
  feat: adding tracing for Executor and added initial configuration (#2957)
* main:
  feat(tracing): tracing part 9 sequencer (#2990)
  build(deps): use mainline go-header (#2988)
* main:
  chore: update calculator for strategies  (#2995)
  chore: adding tracing for da submitter (#2993)
  feat(tracing): part 10 da retriever tracing (#2991)
  chore: add da posting strategy to docs (#2992)
* main:
  build(deps): Bump the all-go group across 5 directories with 5 updates (#2999)
  feat(tracing): adding forced inclusion tracing (#2997)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sync: P2P should provide da inclusion hints

4 participants