Lazy-loading data modules in Python with one magic function

#python #data

Greg Wilson was soliciting strategies for lazy-loading datasets in Python modules. There are, of course, many ways to do this, but I didn't see this one being discussed.

Since Python 3.7 (released in 2017) you can define a module-level __getattr__ function that gets used for name resolution (see PEP 562). You can use this to call a data-loading function the first time each module-level property is accessed. This implementation uses only functions, dicts, and the one magic “dunder” function, which might be more approachable than programming with classes (depending on your audience).

A sketch of an implementation:

# Cache loaded datasets
_dataset_cache = {}

# Dictionary mapping dataset names to loader functions
_dataset_loaders = {
  # maps names to argument-less functions that load datasets
}


def __getattr__(name):
    """PEP 562 module-level __getattr__ function for lazy loading"""
    # Check if we should load a dataset for this attribute
    if name in _dataset_loaders:
        # If not already cached, load and cache it
        if name not in _dataset_cache:
            _dataset_cache[name] = _dataset_loaders[name]()
        return _dataset_cache[name]

    # If not a dataset, raise AttributeError
    raise AttributeError(f"Module has no attribute '{name}'")

For example, this module has attributes foo and bar which are only calculated when they're first referenced:

# foo.py

# Cache loaded datasets
_dataset_cache = {}


def load_data(name):
    if name == "foo":
        print("loading foo")
        return 1
    elif name == "bar":
        print("loading bar")
        return 2
    else:
        # This should be unreachable, but just in case...
        raise AttributeError(f"Module has no attribute '{name}'")


# Dictionary mapping dataset names to loader functions
_dataset_loaders = {
    "foo": lambda: load_data("foo"),
    "bar": lambda: load_data("bar"),
}


def __getattr__(name):
    """PEP 562 module-level __getattr__ function for lazy loading"""
    # Check if we should load a dataset for this attribute
    if name in _dataset_loaders:
        # If not already cached, load and cache it
        if name not in _dataset_cache:
            _dataset_cache[name] = _dataset_loaders[name]()
        return _dataset_cache[name]

    # If not a dataset, raise AttributeError
    raise AttributeError(f"Module has no attribute '{name}'")

In use, this looks like:

>>> import foo
>>> foo.foo
loading foo
1
>>> foo.foo
1
>>> # NOTE: didn't load foo a second time
>>> foo.bar
loading bar
2
>>> foo.quuz
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    foo.quuz
  File "/Users/nknight/tmp/foo.py", line 34, in __getattr__
    raise AttributeError(f"Module has no attribute '{name}'")
AttributeError: Module has no attribute 'quuz'