Add in_memory option to PersistentDataset for hybrid caching #8691

aymuos15 · 2026-01-12T12:28:12Z

Summary

Addresses #6753

This PR adds an in_memory option to PersistentDataset that combines the benefits of persistent disk storage with fast RAM access:

First access: Data is loaded from disk cache (or preprocessed if not cached) and stored in RAM
Subsequent accesses: Data is served directly from RAM

This provides the best of both worlds:

vs default PersistentDataset: No repeated disk reads during training epochs
vs CacheDataset: No reprocessing on script restart

Changes

Add in_memory: bool = False parameter to PersistentDataset
Add memory_cache_size property for inspecting cache state
Optimize _cachecheck() to compute cache key once
Clear memory cache when set_data() is called
Add comprehensive unit tests including automatic hybrid caching test

Usage

from monai.data import PersistentDataset

# Hybrid mode: disk persistence + RAM caching
ds = PersistentDataset(
    data=data,
    transform=transform,
    cache_dir="./cache",
    in_memory=True  # NEW
)

Test plan

Unit tests for in_memory cache functionality
Unit tests for cache_dir=None (pure RAM mode)
Unit tests for automatic hybrid caching behavior
Verified set_data() clears memory cache
Benchmarked against default PersistentDataset and CacheDataset

coderabbitai · 2026-01-12T12:28:25Z

📝 Walkthrough

Walkthrough

The PR adds an in-memory caching layer to PersistentDataset via a new in_memory: bool = False constructor parameter and a _memory_cache dict. A memory_cache_size property reports occupancy. Cache key construction is unified and the in-memory cache is checked before disk; disk hits populate memory and disk writes also populate memory. set_data clears the in-memory cache. Error handling for corrupt disk cache remains. Tests add three cases covering hybrid (disk+RAM), RAM-only, and restart/rebuild behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature addition: an in_memory option for hybrid caching in PersistentDataset.
Description check	✅ Passed	The description covers the motivation (Issue `#6753`), implementation details, usage example, and test plan. However, the template's checklist items are incomplete.
Docstring Coverage	✅ Passed	Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @tests/data/test_persistentdataset.py:
- Around line 294-296: The test expects untransformed data but the used
transform class _InplaceXform mutates data by adding np.pi to data[0], so update
the assertion in the results loop to account for that transform: compute the
expected row as list(range(i)) then add np.pi to its first element (or use numpy
to build expected and add np.pi to expected[0,0]) and compare with results using
a numeric comparison (e.g., np.testing.assert_allclose) instead of plain
equality so cached, transformed values like [[np.pi, 1]] match the expectation.

🧹 Nitpick comments (3)

tests/data/test_persistentdataset.py (1)

231-234: Consider testing object identity for memory cache hits.

With in_memory caching, accessing the same index twice should return the same object reference. Using assertIs(result1, result2) would verify the cache is truly being used.

monai/data/dataset.py (2)

417-422: Static analysis hint: return in try block.

Ruff TRY300 suggests moving the return to an else block. Current code is correct, but restructuring would improve clarity by separating success path from exception handling.

Optional refactor

         if hashfile is not None and hashfile.is_file():  # cache hit
             try:
                 _item_transformed = torch.load(hashfile, weights_only=self.weights_only)
-                if self.in_memory:
-                    self._memory_cache[cache_key] = _item_transformed
-                return _item_transformed
             except PermissionError as e:
                 if sys.platform != "win32":
                     raise e
             except (UnpicklingError, RuntimeError) as e:  # corrupt or unloadable cached files are recomputed
                 if "Invalid magic number; corrupt file" in str(e) or isinstance(e, UnpicklingError):
                     warnings.warn(f"Corrupt cache file detected: {hashfile}. Deleting and recomputing.")
                     hashfile.unlink()
                 else:
                     raise e
+            else:
+                if self.in_memory:
+                    self._memory_cache[cache_key] = _item_transformed
+                return _item_transformed

444-461: Potential type inconsistency between disk and memory cache.

Line 445 converts to tensor before saving to disk with convert_to_tensor, but line 460 stores the original _item_transformed in memory. On reload, disk cache returns tensor types, but memory cache may have different types.

If consistency matters, store the converted version:

Proposed fix

+        _converted = convert_to_tensor(_item_transformed, convert_numeric=False, track_meta=self.track_meta)
         try:
             # NOTE: Writing to a temporary directory and then using a nearly atomic rename operation
             #       to make the cache more robust to manual killing of parent process
             #       which may leave partially written cache files in an incomplete state
             with tempfile.TemporaryDirectory() as tmpdirname:
                 temp_hash_file = Path(tmpdirname) / hashfile.name
                 torch.save(
-                    obj=convert_to_tensor(_item_transformed, convert_numeric=False, track_meta=self.track_meta),
+                    obj=_converted,
                     f=temp_hash_file,
                     pickle_module=look_up_option(self.pickle_module, SUPPORTED_PICKLE_MOD),
                     pickle_protocol=self.pickle_protocol,
                 )
                 if temp_hash_file.is_file() and not hashfile.is_file():
                     # On Unix, if target exists and is a file, it will be replaced silently if the user has permission.
                     # for more details: https://docs.python.org/3/library/shutil.html#shutil.move.
                     try:
                         shutil.move(str(temp_hash_file), hashfile)
                     except FileExistsError:
                         pass
         except PermissionError:  # project-monai/monai issue #3613
             pass
         if self.in_memory:
-            self._memory_cache[cache_key] = _item_transformed
+            self._memory_cache[cache_key] = _converted
         return _item_transformed

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 57fdd59 and e132236.

📒 Files selected for processing (2)

monai/data/dataset.py
tests/data/test_persistentdataset.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

⚙️ CodeRabbit configuration file

Review the Python code for quality and correctness. Ensure variable names adhere to PEP8 style guides, are sensible and informative in regards to their function, though permitting simple names for loop and comprehension variables. Ensure routine names are meaningful in regards to their function and use verbs, adjectives, and nouns in a semantically appropriate way. Docstrings should be present for all definition which describe each variable, return value, and raised exception in the appropriate section of the Google-style of docstrings. Examine code for logical error or inconsistencies, and suggest what may be changed to addressed these. Suggest any enhancements for code improving efficiency, maintainability, comprehensibility, and correctness. Ensure new or modified definitions will be covered by existing or new unit tests.

Files:

tests/data/test_persistentdataset.py
monai/data/dataset.py

🧬 Code graph analysis (1)

tests/data/test_persistentdataset.py (2)

monai/data/dataset.py (5)

memory_cache_size (311-313)

set_data (335-344)

set_data (642-648)

set_data (875-910)

set_data (1114-1128)

tests/data/test_gdsdataset.py (1)

_InplaceXform (65-68)

🪛 Ruff (0.14.10)

monai/data/dataset.py

422-422: Consider moving this statement to an else block

(TRY300)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: packaging
GitHub Check: flake8-py3 (codeformat)
GitHub Check: flake8-py3 (mypy)
GitHub Check: quick-py3 (windows-latest)
GitHub Check: build-docs
GitHub Check: quick-py3 (ubuntu-latest)
GitHub Check: quick-py3 (macOS-latest)
GitHub Check: flake8-py3 (pytype)
GitHub Check: min-dep-pytorch (2.8.0)
GitHub Check: min-dep-py3 (3.12)
GitHub Check: min-dep-pytorch (2.5.1)
GitHub Check: min-dep-pytorch (2.7.1)
GitHub Check: min-dep-py3 (3.9)
GitHub Check: min-dep-py3 (3.10)
GitHub Check: min-dep-os (macOS-latest)
GitHub Check: min-dep-py3 (3.11)
GitHub Check: min-dep-os (windows-latest)
GitHub Check: min-dep-pytorch (2.6.0)
GitHub Check: min-dep-os (ubuntu-latest)

🔇 Additional comments (7)

tests/data/test_persistentdataset.py (2)

240-254: LGTM.

Pure RAM caching test correctly validates memory_cache_size behavior with cache_dir=None.

256-279: LGTM.

Good test coverage for the hybrid caching restart scenario. The use of Path.glob("*.pt") properly verifies disk cache population.

monai/data/dataset.py (5)

235-235: LGTM.

New in_memory parameter with sensible default False. Docstring is clear. Initialization is correct.

Also applies to: 277-280, 307-308

310-313: LGTM.

Simple property with appropriate docstring.

341-341: LGTM.

Memory cache correctly cleared on data update.

433-437: LGTM.

Pure RAM mode correctly stores transformed item when cache_dir=None.

413-422: Thread safety consideration for multi-worker DataLoader.

_memory_cache is a plain dict accessed without synchronization. With num_workers > 0 in DataLoader, each worker is a separate process with its own memory, so this works. However, if used with thread-based data loading or shared memory patterns, concurrent writes could race.

Current usage is safe for typical DataLoader patterns.

tests/data/test_persistentdataset.py

ericspod · 2026-01-18T01:36:21Z

Hi @aymuos15 this looks good to me in general, though I wonder about the warning from Coderabbit regarding your tests. Can yoiu check that the results it's complaining about are actually working? I think it is correct that pi should be added to values. We also are waiting for a hotfix version to be released so it might be a while before we can merge this. Thanks!

- Add `in_memory` parameter to PersistentDataset that combines persistent disk storage with RAM caching for faster subsequent access - Add `memory_cache_size` property for inspecting cache state - Compute cache key once in _cachecheck() to avoid redundant hash computation - Clear memory cache when set_data() is called - Works with or without cache_dir (pure RAM cache mode) Addresses: Project-MONAI#6753 Signed-off-by: Soumya Snigdha Kundu <[email protected]>

aymuos15 · 2026-01-18T02:07:47Z

Thank you very much for the update. I have addressed the code rabbit comment.

ericspod · 2026-01-19T00:14:07Z

Thank you very much for the update. I have addressed the code rabbit comment.

I'm not sure I see anything different in the code. Do the commits you've made since my comment change things? Or was the Coderabbit comment spurious?

- Store tensor-converted data in memory cache to match disk cache types - Fix test_automatic_hybrid_caching assertions to account for _InplaceXform (from coderabbit) Signed-off-by: Soumya Snigdha Kundu <[email protected]>

Signed-off-by: Soumya Snigdha Kundu <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@monai/data/dataset.py`:
- Around line 434-437: The memory cache currently stores _item_transformed when
hashfile is None but stores _item_converted later for hybrid mode, causing type
inconsistency; change the branch that checks if hashfile is None to store
_item_converted (not _item_transformed) in self._memory_cache when
self.in_memory is true and cache_key is available (so both pure RAM and hybrid
modes cache the converted/tensor form consistently), ensuring _item_converted is
created/available before assignment and using the same cache_key lookup used
elsewhere.

🧹 Nitpick comments (1)

tests/data/test_persistentdataset.py (1)

256-311: LGTM!

Comprehensive test covering the key hybrid caching benefits: disk persistence across restarts and automatic RAM cache rebuild.

Minor style nit at line 299: Ruff suggests [np.pi, *list(range(1, i))] instead of concatenation, but this is optional.

coderabbitai · 2026-01-28T03:39:58Z

monai/data/dataset.py

        if hashfile is None:
+            if self.in_memory:
+                self._memory_cache[cache_key] = _item_transformed
            return _item_transformed


⚠️ Potential issue | 🟡 Minor

Memory cache stores different types in pure RAM vs hybrid mode.

Line 436 stores _item_transformed (unconverted), but line 462 stores _item_converted (tensor-converted). This causes type inconsistency between pure RAM mode (cache_dir=None) and hybrid mode.

For consistency, consider storing converted data in both paths:

Proposed fix

if hashfile is None: if self.in_memory: - self._memory_cache[cache_key] = _item_transformed + _item_converted = convert_to_tensor(_item_transformed, convert_numeric=False, track_meta=self.track_meta) + self._memory_cache[cache_key] = _item_converted return _item_transformed

🤖 Prompt for AI Agents

In `@monai/data/dataset.py` around lines 434 - 437, The memory cache currently stores _item_transformed when hashfile is None but stores _item_converted later for hybrid mode, causing type inconsistency; change the branch that checks if hashfile is None to store _item_converted (not _item_transformed) in self._memory_cache when self.in_memory is true and cache_key is available (so both pure RAM and hybrid modes cache the converted/tensor form consistently), ensuring _item_converted is created/available before assignment and using the same cache_key lookup used elsewhere.

aymuos15 · 2026-01-28T03:43:06Z

I hadn't realised I did not add the change back then. Apologies. Have actually addressed the code rabbit comment now.

aymuos15 requested review from KumoLiu, Nic-Ma and ericspod as code owners January 12, 2026 12:28

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

tests/data/test_persistentdataset.py Outdated Show resolved Hide resolved

aymuos15 force-pushed the 6753-persistent-dataset-in-memory-cache branch from e132236 to 749a518 Compare January 18, 2026 02:07

aymuos15 added 2 commits January 28, 2026 03:34

Fix type consistency in memory cache and correct test assertions

1d873cf

- Store tensor-converted data in memory cache to match disk cache types - Fix test_automatic_hybrid_caching assertions to account for _InplaceXform (from coderabbit) Signed-off-by: Soumya Snigdha Kundu <[email protected]>

Merge branch 'dev' into 6753-persistent-dataset-in-memory-cache

9b0166b

Signed-off-by: Soumya Snigdha Kundu <[email protected]>

aymuos15 force-pushed the 6753-persistent-dataset-in-memory-cache branch from bd21776 to 9b0166b Compare January 28, 2026 03:35

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in_memory option to PersistentDataset for hybrid caching #8691

Add in_memory option to PersistentDataset for hybrid caching #8691

Uh oh!

aymuos15 commented Jan 12, 2026

Uh oh!

coderabbitai bot commented Jan 12, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

ericspod commented Jan 18, 2026

Uh oh!

aymuos15 commented Jan 18, 2026

Uh oh!

ericspod commented Jan 19, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 28, 2026

Uh oh!

aymuos15 commented Jan 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add in_memory option to PersistentDataset for hybrid caching #8691

Are you sure you want to change the base?

Add in_memory option to PersistentDataset for hybrid caching #8691

Uh oh!

Conversation

aymuos15 commented Jan 12, 2026

Summary

Changes

Usage

Test plan

Uh oh!

coderabbitai bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericspod commented Jan 18, 2026

Uh oh!

aymuos15 commented Jan 18, 2026

Uh oh!

ericspod commented Jan 19, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

aymuos15 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Jan 12, 2026 •

edited

Loading

aymuos15 commented Jan 28, 2026 •

edited

Loading