# Uncomment to run the notebook in Colab
# ! pip install -q "wax-ml[complete]@git+https://github.com/eserie/wax-ml.git"
# ! pip install -q --upgrade jax jaxlib==0.1.67+cuda111 -f https://storage.googleapis.com/jax-releases/jax_releases.html

# check available devices
import jax

print("jax backend {}".format(jax.lib.xla_bridge.get_backend().platform))
jax.devices()

WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

jax backend cpu

[CpuDevice(id=0)]

🎛 The 3-steps workflow 🎛¶

It is already very useful to be able to execute a JAX function on a dataframe in a single work step and with a single command line thanks to WAX-ML accessors.

The 1-step WAX-ML’s stream API works like that:

<data-container>.stream(...).apply(...)

But this is not optimal because, under the hood, there are mainly three costly steps:

(1) (synchronize | data tracing | encode): make the data “JAX ready”
(2) (compile | code tracing | execution): compile and optimize a function for XLA, execute it.
(3) (format): convert data back to pandas/xarray/numpy format.

With the wax.stream primitives, it is quite easy to explicitly split the 1-step workflow into a 3-step workflow.

This will allow the user to have full control over each step and iterate on each one.

It is actually very useful to iterate on step (2), the “calculation step” when you are doing research. You can then take full advantage of the JAX primitives, especially the jit primitive.

Let’s illustrate how to reimplement WAX-ML EWMA yourself with the WAX-ML 3-step workflow.

Imports¶

import haiku as hk
import numpy as onp
import pandas as pd
import xarray as xr

from wax.accessors import register_wax_accessors
from wax.compile import jit_init_apply
from wax.external.eagerpy import convert_to_tensors
from wax.format import format_dataframe
from wax.modules import EWMA
from wax.stream import tree_access_data
from wax.unroll import dynamic_unroll

register_wax_accessors()

Performance on big dataframes¶

Generate data¶

T = 1.0e5
N = 1000

T, N = map(int, (T, N))
dataframe = pd.DataFrame(
    onp.random.normal(size=(T, N)), index=pd.date_range("1970", periods=T, freq="s")
)

Pandas EWMA¶

%%time
df_ewma_pandas = dataframe.ewm(alpha=1.0 / 10.0).mean()

CPU times: user 4.21 s, sys: 557 ms, total: 4.77 s
Wall time: 4.79 s

WAX-ML EWMA¶

%%time
df_ewma_wax = dataframe.wax.ewm(alpha=1.0 / 10.0).mean()

CPU times: user 2.12 s, sys: 416 ms, total: 2.53 s
Wall time: 2.49 s

It’s a little faster, but not that much faster…

WAX-ML EWMA (without format step)¶

Let’s disable the final formatting step (the output is now in raw JAX format):

%%time
df_ewma_wax_no_format = dataframe.wax.ewm(alpha=1.0 / 10.0, format_outputs=False).mean()

CPU times: user 464 ms, sys: 287 ms, total: 750 ms
Wall time: 751 ms

type(df_ewma_wax_no_format)

jaxlib.xla_extension.DeviceArray

Let’s check the device on which the calculation was performed (if you have GPU available, this should be GpuDevice otherwise it will be CpuDevice):

df_ewma_wax_no_format.device()

CpuDevice(id=0)

That’s better! In fact (see below) there is a performance problem in the final formatting step. See WEP3 for a proposal to improve the formatting step.

Generate data (in dataset format)¶

WAX-ML Sream object works on datasets. So let’s transform the DataFrame into a xarray Dataset:

dataset = xr.DataArray(dataframe).to_dataset(name="dataarray")

Step (1) (synchronize | data tracing | encode)¶

In this step, WAX-ML do:

“data tracing” : prepare the indices for fast access tin the JAX function access_data
synchronize streams if there is multiple ones. This functionality have options : freq, ffills
encode and convert data from numpy to JAX: use encoders for datetimes64 and string_ dtypes. Be aware that by default JAX works in float32 (see JAX’s Common Gotchas to work in float64).

We have a function Stream.prepare that implement this Step (1). It prepares a function that wraps the input function with the actual data and indices in a pair of pure functions (TransformedWithState Haiku tuple).

%%time
stream = dataframe.wax.stream()

CPU times: user 18 µs, sys: 2 µs, total: 20 µs
Wall time: 24.1 µs

Define our custom function to be applied on a dict of arrays having the same structure than the original dataset:

def my_ewma_on_dataset(dataset):
    return EWMA(alpha=1.0 / 10.0, adjust=True)(dataset["dataarray"])

transform_dataset, jxs = stream.prepare(dataset, my_ewma_on_dataset)

Let’s definite the init parameters and state of the transformation we will apply.

Init params and state¶

from wax.unroll import init_params_state

rng = jax.random.PRNGKey(42)
params, state = init_params_state(transform_dataset, rng, jxs)

params

FlatMapping({'ewma': FlatMapping({'alpha': DeviceArray(0.1, dtype=float32)})})

assert state["ewma"]["count"].shape == (N,)
assert state["ewma"]["mean"].shape == (N,)

Step (2) (compile | code tracing | execution)¶

In this step we:

prepare a pure function (with Haiku’s transform mechanism) Define a “transformation” function which:
- access to the data
- apply another transformation, here: EWMA
compile it with jax.jit
perform code tracing and execution (the last line):
- Unroll the transformation on “steps” xs (a np.arange vector).

rng = next(hk.PRNGSequence(42))
outputs, state = dynamic_unroll(transform_dataset, params, state, rng, False, jxs)

outputs.device()

CpuDevice(id=0)

Once it has been compiled and “traced” by JAX, the function is much faster to execute:

%%timeit
outputs, _ = dynamic_unroll(transform_dataset, params, state, rng, False, jxs)
_ = outputs.block_until_ready()

1.58 s ± 18.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is 3x faster than pandas implementation!

Manually prepare the data and manage the device¶

In order to manage the device on which the computations take place, we need to have even more control over the execution flow. Instead of calling stream.prepare to build the transform_dataset function, we can do it ourselves by :

using the stream.trace_dataset function
converting the numpy data in jax ourself
puting the data on the device we want.

np_data, np_index, xs = stream.trace_dataset(dataset)
jnp_data, jnp_index, jxs = convert_to_tensors((np_data, np_index, xs), "jax")

/Users/eserie/dev/git/wax-ml/venv/lib/python3.7/site-packages/jax/_src/numpy/lax_numpy.py:2983: UserWarning: Explicitly requested dtype float64 requested in asarray is not available, and will be truncated to dtype float32. To enable more dtypes, set the jax_enable_x64 configuration option or the JAX_ENABLE_X64 shell environment variable. See https://github.com/google/jax#current-gotchas for more.
  lax._check_user_dtype_supported(dtype, "asarray")
/Users/eserie/dev/git/wax-ml/venv/lib/python3.7/site-packages/jax/_src/numpy/lax_numpy.py:2983: UserWarning: Explicitly requested dtype int64 requested in asarray is not available, and will be truncated to dtype int32. To enable more dtypes, set the jax_enable_x64 configuration option or the JAX_ENABLE_X64 shell environment variable. See https://github.com/google/jax#current-gotchas for more.
  lax._check_user_dtype_supported(dtype, "asarray")

We explicitly set data on CPUs (the is not needed if you only have CPUs):

from jax.tree_util import tree_leaves, tree_map

cpus = jax.devices("cpu")
jnp_data, jnp_index, jxs = tree_map(
    lambda x: jax.device_put(x, cpus[0]), (jnp_data, jnp_index, jxs)
)
print("data copied to CPU device.")

data copied to CPU device.

We have now “JAX-ready” data for later fast access.

Let’s define the transformation that wrap the actual data and indices in a pair of pure functions:

@jit_init_apply
@hk.transform_with_state
def transform_dataset(step):
    dataset = tree_access_data(jnp_data, jnp_index, step)
    return EWMA(alpha=1.0 / 10.0, adjust=True)(dataset["dataarray"])

And we can call it as before:

%%time
outputs, state = dynamic_unroll(transform_dataset, None, None, rng, False, jxs)
_ = outputs.block_until_ready()

CPU times: user 1.87 s, sys: 344 ms, total: 2.22 s
Wall time: 2.13 s

outputs.device()

CpuDevice(id=0)

Step(3) (format)¶

Let’s come back to pandas/xarray:

%%time
y = format_dataframe(
    dataset.coords, onp.array(outputs), format_dims=dataset.dataarray.dims
)

CPU times: user 69.5 ms, sys: 1.59 ms, total: 71.1 ms
Wall time: 69.4 ms

It’s quite slow (see WEP3 enhancement proposal).

GPU execution¶

Let’s look with execution on GPU

try:
    gpus = jax.devices("gpu")
    jnp_data, jnp_index, jxs = tree_map(
        lambda x: jax.device_put(x, gpus[0]), (jnp_data, jnp_index, jxs)
    )
    print("data copied to GPU device.")
    GPU_AVAILABLE = True
except RuntimeError as err:
    print(err)
    GPU_AVAILABLE = False

Unknown backend gpu. Available: ['interpreter', 'cpu']

Let’s check that our data is on the GPUs:

tree_leaves(jnp_data)[0].device()

CpuDevice(id=0)

tree_leaves(jnp_index)[0].device()

CpuDevice(id=0)

jxs.device()

CpuDevice(id=0)

%%time
if GPU_AVAILABLE:
    rng = next(hk.PRNGSequence(42))
    outputs, state = dynamic_unroll(transform_dataset, None, None, rng, False, jxs)

CPU times: user 3 µs, sys: 2 µs, total: 5 µs
Wall time: 8.82 µs

Let’s redefine our function transform_dataset by explicitly specify to jax.jit the device option.

%%time
if GPU_AVAILABLE:

    @hk.transform_with_state
    def transform_dataset(step):
        dataset = tree_access_data(jnp_data, jnp_index, step)
        return EWMA(alpha=1.0 / 10.0, adjust=True)(dataset["dataarray"])

    transform_dataset = type(transform_dataset)(
        transform_dataset.init, jax.jit(transform_dataset.apply, device=gpus[0])
    )

    rng = next(hk.PRNGSequence(42))
    outputs, state = dynamic_unroll(transform_dataset, None, None, rng, False, jxs)

CPU times: user 3 µs, sys: 2 µs, total: 5 µs
Wall time: 9.06 µs

outputs.device()

CpuDevice(id=0)

%%timeit
if GPU_AVAILABLE:
    outputs, state = dynamic_unroll(transform_dataset, None, None, rng, False, jxs)
    _ = outputs.block_until_ready()

15.9 ns ± 0.0633 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)