Skip to content

Component specification#

Each Fondant component is defined by a component specification which describes its interface. The component specification is used for a couple of things:

  • To define which input data Fondant should provide to the component, and which output data it should write to storage.
  • To validate compatibility with other components.
  • To execute the component with the correct parameters.

The component specification should be defined by the author of the component.

Contents#

A component spec(ification) consists of the following sections:

name:
  ...
description:
  ...
image:
  ...

consumes:
  ...

produces: 
  ...

args: 
  ...

Metadata#

The metadata tracks metadata about the component, such as its name, description, and the URL of the Docker image used to run it.

name: Example component
description: This is an example component
image: example_component:latest

Consumes & produces#

The consumes and produces sections describe which data the component consumes and produces. The specification below for instance defines a component that creates an embedding from an image-caption combination.

...
consumes:
  images:
    fields:
      data:
        type: binary
  captions:
    fields:
      text:
        type: utf8

produces:
  embeddings:
    fields:
      data:
        type: array
        items:
          type: float32

The consumes and produces sections follow the schema below:

consumes/produces:
  <subset>:
    fields:
      <field>:
        type: <type>
    additionalFields: true
  additionalSubsets: true

Subsets#

A component consumes or produces subsets which match the subsets from the manifest.

  • Only those subsets defined in the consumes section of the component specification are read and passed to the component implementation.
  • Only those subsets defined in the produces section of the component specification are written to storage.

Fields#

Each subset defines a list of fields, which again match those from the manifest.

  • Only those fields defined in the consumes section of the component specification are read and passed to the component implementation.
  • Only those fields defined in the produces section of the component specification are written to storage

Each field defines the expected data type, which should match the types defined by Fondant, which mostly match the Arrow data types.

AdditionalSubsets & additionalFields#

The schema also defines the additionalSubsets and additionalFields keywords, which can be used to define which additional data should be passed on from the input to the output. They both default to true, which means that by default untouched data is passed on to the next component.

  • If additionalSubsets is false in the consumes section, all subsets not specified in the component specification's consumes will be dropped.
  • If additionalSubsets is false in the produces section, all subsets not specified in the component specification's produces section will be dropped, including consumed subsets.
  • If additionalFields is false for a subset in the consumes section, all fields not specified will be dropped.
  • If additionalFields is false for a subset in the produces section, all fields not specified will be dropped, including consumed fields.

Please check the examples below to build a better understanding.

Args#

The args section describes which arguments the component takes. Each argument is defined by a description and a type, which should be one of the builtin Python types. Additionally, you can set an optional default value for each argument.

args:
  custom_argument:
    description: A custom argument
    type: str
  default_argument:
    description: A default argument
    type: str
    default: bar

These arguments are passed in when the component is instantiated. If an argument is not explicitly provided, the default value will be used instead if available.

from fondant.pipeline import ComponentOp

custom_op = ComponentOp(
    component_dir="components/custom_component",
    arguments={
        "custom_argument": "foo"
    },
)

Afterwards, we pass all keyword arguments to the __init__() method of the component.

import pandas as pd
from fondant.component import PandasTransformComponent
from fondant.executor import PandasTransformExecutor


class ExampleComponent(PandasTransformComponent):

  def __init__(self, *args, custom_argument, default_argument) -> None:
    """
    Args:
        x_argument: An argument passed to the component
    """
    # Initialize your component here based on the arguments

  def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
    """Implement your custom logic in this single method

    Args:
        dataframe: A Pandas dataframe containing the data

    Returns:
        A pandas dataframe containing the transformed data
    """

Examples#

Each component specification defines how the input manifest will be transformed into the output manifest. The following examples show how the component specification works:

Example 1: defaults#

Even though only a single subset and field are defined in both consumes and produces, all data is passed along since additionalSubsets and additionalFields default to true.

Input manifest Component spec Output manifest
{
  "subsets": {
    "images": {
      "location": "...",
      "fields": {
        "width": {
          "type": "int32"
        },
        "height": {
          "type": "int32"
        },
        "data": {
          "type": "binary"
        }
      }
    },
    "captions": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    }
  }
}
consumes:
  images:
    fields:
      data:
        type: binary

produces:
  embeddings:
    fields:
      data:
        type: array
        items:
          type: float32
{
  "subsets": {
    "images": {
      "location": "...",
      "fields": {
        "width": {
          "type": "int32"
        },
        "height": {
          "type": "int32"
        },
        "data": {
          "type": "binary"
        }
      }
    },
    "captions": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    },
    "embeddings": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    }
  }
}

Example 2: additionalSubsets: false in consumes#

When changing additionalSubsets in consumes to false, the unused captions subset is dropped.

Input manifest Component spec Output manifest
{
  "subsets": {
    "images": {
      "location": "...",
      "fields": {
        "width": {
          "type": "int32"
        },
        "height": {
          "type": "int32"
        },
        "data": {
          "type": "binary"
        }
      }
    },
    "captions": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    }
  }
}
consumes:
  images:
    fields:
      data:
        type: binary
  additionalSubsets: false

produces:
  embeddings:
    fields:
      data:
        type: array
        items:
          type: float32
{
  "subsets": {
    "images": {
      "location": "...",
      "fields": {
        "width": {
          "type": "int32"
        },
        "height": {
          "type": "int32"
        },
        "data": {
          "type": "binary"
        }
      }
    },
    "embeddings": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    }
  }
}

Example 3: additionalFields: false in consumes#

When changing additionalFields in the consumed images subset to false, the unused fields of the images subset are dropped as well.

Input manifest Component spec Output manifest
{
  "subsets": {
    "images": {
      "location": "...",
      "fields": {
        "width": {
          "type": "int32"
        },
        "height": {
          "type": "int32"
        },
        "data": {
          "type": "binary"
        }
      }
    },
    "captions": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    }
  }
}
consumes:
  images:
    fields:
      data:
        type: binary
    additionalFields: false
  additionalSubsets: false

produces:
  embeddings:
    fields:
      data:
        type: array
        items:
          type: float32
{
  "subsets": {
    "images": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    },
    "embeddings": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    }
  }
}

Example 4 additionalSubsets: false in produces#

When changing additionalSubsets in produces to false, both the unused captions subset and the consumed images subsets are dropped.

Input manifest Component spec Output manifest
{
  "subsets": {
    "images": {
      "location": "...",
      "fields": {
        "width": {
          "type": "int32"
        },
        "height": {
          "type": "int32"
        },
        "data": {
          "type": "binary"
        }
      }
    },
    "captions": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    }
  }
}
consumes:
  images:
    fields:
      data:
        type: binary

produces:
  embeddings:
    fields:
      data:
        type: array
        items:
          type: float32
  additionalSubsets: false
{
  "subsets": {
    "embeddings": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    }
  }
}

Example 5: overwriting subsets#

Finally, when we define a subset both in consumes and produces, the produced fields overwrite the consumed ones. Others are passed on according to the additionalFields flag.

Input manifest Component spec Output manifest
{
  "subsets": {
    "images": {
      "location": "...",
      "fields": {
        "width": {
          "type": "int32"
        },
        "height": {
          "type": "int32"
        },
        "data": {
          "type": "binary"
        }
      }
    },
    "captions": {
      "location": "...",
      "fields": {
        "data": {
          "type": "binary"
        }
      }
    }
  }
}
consumes:
  images:
    fields:
      data:
        type: binary

produces:
  images:
    fields:
      data:
        type: string
  additionalSubsets: false
{
  "subsets": {
    "images": {
      "location": "...",
      "fields": {
        "width": {
          "type": "int32"
        },
        "height": {
          "type": "int32"
        },
        "data": {
          "type": "string"
        }
      }
    }
  }
}