Dataset and Dataloader¶

The first component of Emmental’s pipeline is to use user provided data to create Emmental Dataset and Dataloader.

Dataset and Dataloader Classes¶

The following docs describe elements of Emmental’s Dataset and Dataloader.

Emmental dataset and dataloader.

class emmental.data.EmmentalDataLoader(task_to_label_dict, dataset, split='train', collate_fn=None, n_batches=None, **kwargs)[source]¶

Bases: torch.utils.data.dataloader.DataLoader

Emmental dataLoader.

An advanced dataloader class which contains mapping from task to label (which label(s) to use in dataset’s Y_dict for this task), and split (which part this dataset belongs to) information.

Parameters

task_to_label_dict (Dict[str, str]) – The task to label mapping where key is the task name and value is the label(s) for that task and should be the key in Y_dict.
dataset (EmmentalDataset) – The dataset to construct the dataloader
split (str) – The split information, defaults to “train”.
collate_fn (Optional[Callable]) – The function that merges a list of samples to form a mini-batch, defaults to emmental_collate_fn.
n_batches (Optional[int]) – Total number of batches.
**Kwargs – Other arguments of dataloader.

class emmental.data.EmmentalDataset(name, X_dict, Y_dict=None, uid=None)[source]¶

Bases: torch.utils.data.dataset.Dataset

Emmental dataset.

An advanced dataset class to handle that the input data contains multiple fields and the output data contains multiple label sets.

Parameters

name (str) – The name of the dataset.
X_dict (Dict[str, Any]) – The feature dict where key is the feature name and value is the feature.
Y_dict (Optional[Dict[str, Tensor]]) – The label dict where key is the label name and value is the label, defaults to None.
uid (Optional[str]) – The unique id key in the X_dict, defaults to None.

add_features(X_dict)[source]¶

Add new features into X_dict.

Parameters: X_dict (Dict[str, Any]) – The new feature dict to add into the existing feature dict.
Return type: None

add_labels(Y_dict)[source]¶

Add new labels into Y_dict.

Parameters: Y_dict (Dict[str, Tensor]) – the new label dict to add into the existing label dict
Return type: None

remove_feature(feature_name)[source]¶

Remove one feature from feature dict.

Parameters: feature_name (str) – The feature that removes from feature dict.
Return type: None

remove_label(label_name)[source]¶

Remove one label from label dict.

Parameters: label_name (str) – The label that removes from label dict.
Return type: None

emmental.data.emmental_collate_fn(batch, min_data_len=0, max_data_len=0)[source]¶

Collate function.

Parameters

batch (Union[List[Tuple[Dict[str, Any], Dict[str, Tensor]]], List[Dict[str, Any]]]) – The batch to collate.
min_data_len (int) – The minimal data sequence length, defaults to 0.
max_data_len (int) – The maximal data sequence length (0 means no limit), defaults to 0.

Return type

Union[Tuple[Dict[str, Any], Dict[str, Tensor]], Dict[str, Any]]

Returns

The collated batch.

Configuration Settings¶

Visit the Configuring Emmental page to see how to provide configuration parameters to Emmental via .emmental-config.yaml.

The parameters of data are described below:

# Data configuration
data_config:
    min_data_len: 0 # min data length
    max_data_len: 0 # max data length (e.g., 0 for no max_len)