Dataset and Dataloader¶
The first component of Emmental’s pipeline is to use user provided data to create Emmental Dataset and Dataloader.
Dataset and Dataloader Classes¶
The following docs describe elements of Emmental’s Dataset and Dataloader.
Emmental dataset and dataloader.
- class emmental.data.EmmentalDataLoader(task_to_label_dict, dataset, split='train', collate_fn=None, n_batches=None, **kwargs)[source]¶
Bases:
torch.utils.data.dataloader.DataLoader
Emmental dataLoader.
An advanced dataloader class which contains mapping from task to label (which label(s) to use in dataset’s Y_dict for this task), and split (which part this dataset belongs to) information.
- Parameters
task_to_label_dict (
Dict
[str
,str
]) – The task to label mapping where key is the task name and value is the label(s) for that task and should be the key in Y_dict.dataset (
EmmentalDataset
) – The dataset to construct the dataloadersplit (
str
) – The split information, defaults to “train”.collate_fn (
Optional
[Callable
]) – The function that merges a list of samples to form a mini-batch, defaults to emmental_collate_fn.n_batches (
Optional
[int
]) – Total number of batches.**Kwargs – Other arguments of dataloader.
- class emmental.data.EmmentalDataset(name, X_dict, Y_dict=None, uid=None)[source]¶
Bases:
torch.utils.data.dataset.Dataset
Emmental dataset.
An advanced dataset class to handle that the input data contains multiple fields and the output data contains multiple label sets.
- Parameters
name (
str
) – The name of the dataset.X_dict (
Dict
[str
,Any
]) – The feature dict where key is the feature name and value is the feature.Y_dict (
Optional
[Dict
[str
,Tensor
]]) – The label dict where key is the label name and value is the label, defaults to None.uid (
Optional
[str
]) – The unique id key in the X_dict, defaults to None.
- add_features(X_dict)[source]¶
Add new features into X_dict.
- Parameters
X_dict (
Dict
[str
,Any
]) – The new feature dict to add into the existing feature dict.- Return type
None
- add_labels(Y_dict)[source]¶
Add new labels into Y_dict.
- Parameters
Y_dict (
Dict
[str
,Tensor
]) – the new label dict to add into the existing label dict- Return type
None
- emmental.data.emmental_collate_fn(batch, min_data_len=0, max_data_len=0)[source]¶
Collate function.
- Parameters
batch (
Union
[List
[Tuple
[Dict
[str
,Any
],Dict
[str
,Tensor
]]],List
[Dict
[str
,Any
]]]) – The batch to collate.min_data_len (
int
) – The minimal data sequence length, defaults to 0.max_data_len (
int
) – The maximal data sequence length (0 means no limit), defaults to 0.
- Return type
Union
[Tuple
[Dict
[str
,Any
],Dict
[str
,Tensor
]],Dict
[str
,Any
]]- Returns
The collated batch.
Configuration Settings¶
Visit the Configuring Emmental page to see how to provide configuration
parameters to Emmental via .emmental-config.yaml
.
The parameters of data are described below:
# Data configuration
data_config:
min_data_len: 0 # min data length
max_data_len: 0 # max data length (e.g., 0 for no max_len)