Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning 2.0.0 conventions. The maintainers will create a git tag for each release and increment the version number found in emmental/_version.py accordingly. We release tagged versions to PyPI automatically using GitHub Actions.

Note

Emmental is still under active development and APIs may still change rapidly. Until we release v1.0.0, changes in MINOR version indicate backward incompatible changes.

0.1.1 - 2022-01-11

Fixed

0.1.0 - 2021-11-24

Deprecated

  • @senwu: Deprecated argument active in learner and loss function api, and deprecated ignore_index argument in configuration. (#107)

Fixed

  • @senwu: Fix the metric cannot calculate issue when scorer is none. (#112)

  • @senwu: Fix Meta.config is None issue in collate_fn with num_workers > 1 when using python 3.8+ on mac. (#117)

Added

  • @senwu: Introduce two new classes: Action and Batch to make the APIs more modularized and make Emmental more extendable and easy to use for downstream tasks. (#116)

Note

1. We introduce two new classes: Action and Batch to make the APIs more modularized.

  • Action are objects that populate the task_flow sequence. It has three attributes: name, module and inputs where name is the name of the action, module is the module name of the action and inputs is the inputs to the action. By introducing a class for specifying actions in the task_flow, we standardize its definition. Moreover, Action enables more user flexibility in specifying a task flow as we can now support a wider-range of formats for the input attribute of a task_flow as discussed in (2).

  • Batch is the object that is returned from the Emmental Scheduler. Each Batch object has 6 attributes: uids (uids of the samples), X_dict (input features of the samples), Y_dict (output of the samples), task_to_label_dict (the task to label mapping), data_name (name of the dataset that samples come from), and split (the split information). By defining the Batch class, we unify and standardize the training scheduler interface by ensuring a consistent output format for all schedulers.

2. We make the task_flow more flexible by supporting more formats for specifying inputs to each module.

  • It now supports str as inputs (e.g., inputs=”input1”) which means take the input1’s output as input for current action.

  • It also supports a list as inputs which can be constructed by three different formats:

    • x (x is str) where takes whole output of x’s output as input: this enables users to pass all outputs from one module to another without having to manually specify every input to the module.

    • (x, y) (y is int) where takes x’s y-th output as input.

    • (x, y) (y is str) where takes x’s output str as input.

Few emmental.Action examples:

from emmental.Action as Act
Act(name="input", module="input_module0", inputs=[("_input_", "data")])
Act(name="input", module="input_module0", inputs=[("_input_", 0)])
Act(name="input", module="input_module0", inputs=["_input_"])
Act(name="input", module="input_module0", inputs="_input_")
Act(name="input", module="input_module0", inputs=[("_input_", "data"), ("_input_", 1), "_input_"])
Act(name="input", module="input_module0", inputs=None)

This design also can be applied to action_outputs, here are few example:

action_outputs=[(f"{task_name}_pred_head", 0), ("_input_", "data"), f"{task_name}_pred_head"]
action_outputs="_input_"

0.0.9 - 2021-10-05

Added

  • @senwu: Support wandb logging. (#99)

  • @senwu: Fix log writer cannot dump functions in Meta.config issue. (#103)

  • @senwu: Add return_loss argument model predict and forward to support the case when no loss calculation can be done or needed. (#105)

  • @lorr1 and @senwu: Add skip_learned_data to support skip trained data in learning. (#101, #108)

Fixed

  • @senwu: Fix model learning that cannot handle task doesn’t have Y_dict from dataloasder such as contrastive learning. (#105)

0.0.8 - 2021-02-14

Added

Note

To output model immediate_ouput, the user needs to specify which module output he/she wants to output in EmmentalTask’s action_outputs. It should be a pair of task_flow name and index or list of that pair. During the prediction phrase, the user needs to set return_action_outputs=True to get the outputs where the key is {task_flow name}_{index}.

task_name = "Task1"
EmmentalTask(
    name=task_name,
    module_pool=nn.ModuleDict(
        {
            "input_module": nn.Linear(2, 8),
            f"{task_name}_pred_head": nn.Linear(8, 2),
        }
    ),
    task_flow=[
        {
            "name": "input",
            "module": "input_module",
            "inputs": [("_input_", "data")],
        },
        {
            "name": f"{task_name}_pred_head",
            "module": f"{task_name}_pred_head",
            "inputs": [("input", 0)],
        },
    ],
    loss_func=partial(ce_loss, task_name),
    output_func=partial(output, task_name),
    action_outputs=[
        (f"{task_name}_pred_head", 0),
        ("_input_", "data"),
        (f"{task_name}_pred_head", 0),
    ],
    scorer=Scorer(metrics=task_metrics[task_name]),
)
  • @senwu: Support action output dict. (#82)

  • @senwu: Add a new argument online_eval. If online_eval is off, then model won’t return probs. (#89)

  • @senwu: Support multiple device training and inference. (#91)

Note

To train model on multiple devices such as CPU and GPU, the user needs to specify which module is on which device in EmmentalTask’s module_device. It’s a ditctionary with key as the module_name and value as device number. During the training and inference phrase, the Emmental will automatically perform forward pass based on module device information.

task_name = "Task1"
EmmentalTask(
    name=task_name,
    module_pool=nn.ModuleDict(
        {
            "input_module": nn.Linear(2, 8),
            f"{task_name}_pred_head": nn.Linear(8, 2),
        }
    ),
    task_flow=[
        {
            "name": "input",
            "module": "input_module",
            "inputs": [("_input_", "data")],
        },
        {
            "name": f"{task_name}_pred_head",
            "module": f"{task_name}_pred_head",
            "inputs": [("input", 0)],
        },
    ],
    loss_func=partial(ce_loss, task_name),
    output_func=partial(output, task_name),
    action_outputs=[
        (f"{task_name}_pred_head", 0),
        ("_input_", "data"),
        (f"{task_name}_pred_head", 0),
    ],
    module_device={"input_module": -1, f"{task_name}_pred_head": 0},
    scorer=Scorer(metrics=task_metrics[task_name]),
)
  • @senwu: Add require_prob_for_eval and require_pred_for_eval to optimize score function performance. (#92)

Note

The current approach during score the model will store probs and preds which might require a lot of memory resources especially for large datasets. The score function is also used in training. To optimize the score function performance, this PR introduces two new arguments in EmmentalTask: require_prob_for_eval and require_pred_for_eval which automatically selects whether return_probs or return_preds.

task_name = "Task1"
EmmentalTask(
    name=task_name,
    module_pool=nn.ModuleDict(
        {
            "input_module": nn.Linear(2, 8),
            f"{task_name}_pred_head": nn.Linear(8, 2),
        }
    ),
    task_flow=[
        {
            "name": "input",
            "module": "input_module",
            "inputs": [("_input_", "data")],
        },
        {
            "name": f"{task_name}_pred_head",
            "module": f"{task_name}_pred_head",
            "inputs": [("input", 0)],
        },
    ],
    loss_func=partial(ce_loss, task_name),
    output_func=partial(output, task_name),
    action_outputs=[
        (f"{task_name}_pred_head", 0),
        ("_input_", "data"),
        (f"{task_name}_pred_head", 0),
    ],
    module_device={"input_module": -1, f"{task_name}_pred_head": 0},
    require_prob_for_eval=True,
    require_pred_for_eval=True,
    scorer=Scorer(metrics=task_metrics[task_name]),
)
  • @senwu: Support save and load optimizer and lr_scheduler checkpoints. (#93)

  • @senwu: Support step based learning and add argument start_step and n_steps to set starting step and total step size. (#93)

Fixed

  • @senwu: Fix customized optimizer support issue. (#81)

  • @senwu: Fix loss logging didn’t count task weight. (#93)

0.0.7 - 2020-06-03

Added

  • @senwu: Support gradient accumulation step when machine cannot run large batch size. (#74)

  • @senwu: Support user specified parameter groups in optimizer. (#74)

Note

When building the emmental learner, user can specify parameter groups for optimizer using emmental.Meta.config[“learner_config”][“optimizer_config”][“parameters”] which is function takes the model as input and outputs a list of parameter groups, otherwise learner will create a parameter group with all parameters in the model. Below is an example of optimizing Adam Bert.

def grouped_parameters(model):
    no_decay = ["bias", "LayerNorm.weight"]
    return [
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if not any(nd in n for nd in no_decay)
            ],
            "weight_decay": emmental.Meta.config["learner_config"][
                "optimizer_config"
            ]["l2"],
        },
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.0,
        },
    ]

emmental.Meta.config["learner_config"]["optimizer_config"][
    "parameters"
] = grouped_parameters

Changed

  • @senwu: Enabled “Type hints (PEP 484) support for the Sphinx autodoc extension.” (#69)

  • @senwu: Refactor docstrings and enforce using flake8-docstrings. (#69)

0.0.6 - 2020-04-07

Added

  • @senwu: Support probabilistic gold label in scorer.

  • @senwu: Add add_tasks to support adding one task or mulitple tasks into model.

  • @senwu: Add use_exact_log_path to support using exact log path.

Note

When init the emmental there is one extra argument use_exact_log_path to use exact log path.

emmental.init(dirpath, use_exact_log_path=True)

Changed

  • @senwu: Change running evaluation only when evaluation is triggered.

0.0.5 - 2020-03-01

Added

  • @senwu: Add checkpoint_all to controll whether to save all checkpoints.

  • @senwu: Support CosineAnnealingLR, CyclicLR, OneCycleLR, ReduceLROnPlateau lr scheduler.

  • @senwu: Support more unit tests.

  • @senwu: Support all pytorch optimizers.

  • @senwu: Support accuracy@k metric.

  • @senwu: Support cosine annealing lr scheduler.

Fixed

  • @senwu: Fix multiple checkpoint_metric issue.

0.0.4 - 2019-11-11

Added

  • @senwu: Log metric dict into log file every trigger evaluation time or full epoch.

  • @senwu: Add get_num_batches to calculate the total number batches from all dataloaders.

  • @senwu: Add n_batches in EmmentalDataLoader and fillup in Scheduler to support customize dataloader.

  • @senwu: Add overall and task specific loss during evaluating as default. to support user needs for clear checkpoins.

  • @senwu: Add min_len and max_len in Meta.config to support setting sequence length.

  • @senwu: Add overall and task specific loss during evaluating as default.

  • @senwu: Calculate overall and task specific metrics and loss during training.

  • @senwu: Add more util functions, e.g., array_to_numpy, construct_identifier, and random_string.

  • @senwu: Enforce dataset has uids attribute.

  • @senwu: Add micro/macro metric options which have split-wise micro/macro average and global-wise micro/macro average. The name for the metrics are:

split-wise micro average: `model/all/{split}/micro_average`
split-wise macro average: `model/all/{split}/macro_average`
global-wise micro average: `model/all/all/micro_average`
global-wise macro average: `model/all/all/macro_average`
Note: micro means average all metrics from all tasks. macro means average all

average metric from all tasks.

  • @senwu: Add contrib folder to support unofficial usages.

Fixed

  • @senwu: Correct lr update for epoch-wised scheduler.

  • @senwu: Add type for class.

  • @senwu: Add warning for one class in ROC AUC metric.

  • @senwu: Fix missing support for StepLR and MultiStepLR lr scheduler.

  • @senwu: Fix missing pytest.ini and fix test cannot remove temp dir issue.

  • @senwu: Fix default train loss metric from model/train/all/loss to model/all/train/loss to follow the format TASK_NAME/DATA_NAME/SPLIT/METRIC pattern.

Changed

  • @senwu: Change default grad clip to None.

  • @senwu: Update seed and grad_clip to nullable.

  • @senwu: Change default class index to 0-index.

  • @senwu: Change default ignore_index to None.

  • @senwu: Change the default counter unit to epoch.

  • @senwu: Update the metric to return one metric value by default.

Removed

  • @senwu: Remove checkpoint_clear argument.