Dataset and Dataloader

The first component of Emmental’s pipeline is to use user provided data to create Emmental Dataset and Dataloader.

Dataset and Dataloader Classes

The following docs describe elements of Emmental’s Dataset and Dataloader.

Emmental dataset and dataloader.

class emmental.data.EmmentalDataLoader(task_to_label_dict, dataset, split='train', collate_fn=None, n_batches=None, **kwargs)[source]

Bases: torch.utils.data.dataloader.DataLoader

Emmental dataLoader.

An advanced dataloader class which contains mapping from task to label (which label(s) to use in dataset’s Y_dict for this task), and split (which part this dataset belongs to) information.

Parameters
  • task_to_label_dict (Dict[str, str]) – The task to label mapping where key is the task name and value is the label(s) for that task and should be the key in Y_dict.

  • dataset (EmmentalDataset) – The dataset to construct the dataloader

  • split (str) – The split information, defaults to “train”.

  • collate_fn (Optional[Callable]) – The function that merges a list of samples to form a mini-batch, defaults to emmental_collate_fn.

  • n_batches (Optional[int]) – Total number of batches.

  • **Kwargs – Other arguments of dataloader.

class emmental.data.EmmentalDataset(name, X_dict, Y_dict=None, uid=None)[source]

Bases: torch.utils.data.dataset.Dataset

Emmental dataset.

An advanced dataset class to handle that the input data contains multiple fields and the output data contains multiple label sets.

Parameters
  • name (str) – The name of the dataset.

  • X_dict (Dict[str, Any]) – The feature dict where key is the feature name and value is the feature.

  • Y_dict (Optional[Dict[str, Tensor]]) – The label dict where key is the label name and value is the label, defaults to None.

  • uid (Optional[str]) – The unique id key in the X_dict, defaults to None.

add_features(X_dict)[source]

Add new features into X_dict.

Parameters

X_dict (Dict[str, Any]) – The new feature dict to add into the existing feature dict.

Return type

None

add_labels(Y_dict)[source]

Add new labels into Y_dict.

Parameters

Y_dict (Dict[str, Tensor]) – the new label dict to add into the existing label dict

Return type

None

remove_feature(feature_name)[source]

Remove one feature from feature dict.

Parameters

feature_name (str) – The feature that removes from feature dict.

Return type

None

remove_label(label_name)[source]

Remove one label from label dict.

Parameters

label_name (str) – The label that removes from label dict.

Return type

None

emmental.data.emmental_collate_fn(batch, min_data_len=0, max_data_len=0)[source]

Collate function.

Parameters
  • batch (Union[List[Tuple[Dict[str, Any], Dict[str, Tensor]]], List[Dict[str, Any]]]) – The batch to collate.

  • min_data_len (int) – The minimal data sequence length, defaults to 0.

  • max_data_len (int) – The maximal data sequence length (0 means no limit), defaults to 0.

Return type

Union[Tuple[Dict[str, Any], Dict[str, Tensor]], Dict[str, Any]]

Returns

The collated batch.

Configuration Settings

Visit the Configuring Emmental page to see how to provide configuration parameters to Emmental via .emmental-config.yaml.

The parameters of data are described below:

# Data configuration
data_config:
    min_data_len: 0 # min data length
    max_data_len: 0 # max data length (e.g., 0 for no max_len)