EMNLP 2020 Notes

Have a look at the newest paper on EMNLP 2020, introducing each paper on problem/solution/result perspective.

Eric Lam
20 min readJul 13, 2021
Photo by Adolfo Félix on Unsplash

Machine Learning for NLP

Seq2Edits: Sequence Transduction Using Span-level Edit Operations

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.418
Problem: speed up seq2seq generation via editing.
Solution:
highly overlap input target can be done by replace/delete/copy.

Result:

PatchBERT: Just-in-Time, Out-of-Vocabulary Patching

Paper:
Problem: Improve Out-of-Vocabulary performance

Solution:
find a existed subword

Result:

Pre-Training Transformers as Energy-Based Cloze Models

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.20
Problem: effective pretraining model

Solution:

Whether ELECTRA or not?

ELECTRA can be viewed as a variant of Electric using negative sampling instead of noise-contrastive estimation.

Result:

Sequence-Level Mixed Sample Data Augmentation

Paper:
Problem: capture systematic compositionality

Solution:

Result:

Autoregressive Knowledge Distillation through Imitation Learning

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.494
Problem: autoregressive models compression
train input -> from teacher’s output
inference input -> from student’s previous output
training-inference inconsistency causes a decrease in generation quality.

Solution:

  • the student model must be trained on its own state distribution so that it will perform better at generation.
  • the teacher model should play the role of the oracle and correct the student’s generations at each time step.
    populates its training set by adding new data generated from the oracle-learner mixture. It then re-trains the policy learner on the aggregated dataset at each iteration.

Result:

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.634
Problem: fine-tune with less forgetting

Solution:

Result:

On Losses for Modern Language Models

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.403
Problem: investigate pre-training task to find a better pre-train way.
Solution:

Result:

Semantic Label Smoothing for Sequence to Sequence Problems

Paper:
Problem: apply label smoothing on seq2seq task

Solution:

Result:

Embedding Words in Non-Vector Space with Unsupervised Graph Learning

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.594
Problem: graph word embedding to capture hierarchical structure
Solution:
vertex(word), edge, and weight are learning objectives.
Step by step:

Objective — two node distance

Result:

Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.493
Problem: Find an optimal masking policy to further pre-training, which brings better task-specific results.
Solution:
Neural Mask Generator — Find the masking scheme that can best boost pre-training performance using deep RL

Result:

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.36
Problem: knowledge-distillation in the intermediate representations not consider each layer relation

Solution:
Average pooling on intermediate representations

Use constructive loss to narrow teacher and student representations, others input data will be negative sample

memory bank handles huge negative samples

Result:

Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.17
Problem: multi-head attention may suffer from attention collapse
Solution:
head parameter from a distribution

different head match different part of target distribution

Result:
code: https://github.com/bangann/Repulsive-Attention

Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.38
Problem: Lack of data in fine-tuning
Solution:
self-supervise meta-learning task

Result:

Lifelong Language Knowledge Distillation

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.233
Problem: catastrophic forgetting problem in lifelong learning
Solution:

Apply KD on LM

Result:

BAE: BERT-based Adversarial Examples for Text Classification

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.498
Problem: Adversarial attack on classification

Solution:

Result:

Grounded Compositional Outputs for Adaptive Language Modeling

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.96
Problem: word-level language model with a size that does not depend on the training vocabulary
Solution:

Result:

Language Generation

Plan ahead: Self-Supervised Text Planning for Paragraph Completion Task

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.529
Problem: Paragraph level MLM does not focus on topical content.
Solution:
Keywords guide generation

Result:

Augmented Natural Language for Generative Sequence Labeling

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.27
Problem: improve low-resource slot labeling by joint sequence labeling and sentence-level classification.
Solution:
Seq2Seq generative model

Result:

A* Beam Search

Paper: not provided
Problem: Speed up beam search
Solution:
Best first beam search
Only compute the best score path each time
Best -0.4

Best -0.6

Best -1.5

until output eos

Result:

COD3S: Diverse Generation with Discrete Semantic Signatures

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.421
Problem: Diverse sentence generation — same input, different semantic output
Solution:

semantic embedding — SBERT
LSH compress embedding into x bit

number of bits that can capture original embedding similarity info.(Preserves cos similarity)

decoding process both depend on LSH and input token.

Result:

Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.58
Problem: left-to-right language model that can consider future input.
Solution: off-the-shelf, left-to-right language models and no supervision.
regular decoding

decode future input z, get updated gradient vector

mix both vector, sampling new result

Repeat

Result:

Reformulating Unsupervised Style Transfer as Paraphrase Generation

Paper: http://style.cs.umass.edu
Problem: propose to improve semantic preservation in style transfer
Solution:
Reformulate style transfer as a controlled paraphrase generation task.

Diverse paraphraser — generate paraphrase sentence on each style sentence using Language Model(GPT)
this process can normalize each style result — Language Model learns a general way to generate sentences.

Inverse paraphraser — convert paraphrase sentence to original style sentence
Swap inverse paraphraser for style transfer!!!

Result:

Unsupervised Text Style Transfer with Padded Masked Language Models

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.699
Problem: non parallel style transfer
Solution:

Result:

Dialog and Interactive Systems

Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Paper: https://virtual.2020.emnlp.org/paper_main.2281.html
Problem: Reducing cost to evaluate bot with the human.
Solution:
Rank a set of bots, determent by bot-bot conversation instead of bot-human.
Humans decide whether the conversation is bot or human.
Long conversation will expose bot.

Result:

Dialogue Response Ranking Training with Large-Scale Human Feedback Data

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.28/
Problem: Dialog Response improve with human feedback
Solution:

Result:

Resources

More Bang for Your Buck: Natural Perturbation for Robust Question Answering

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.12/

{ "cluster-id" : "25938" , "question_id" : "267" , "is_seed_question" : 0 , "split" : "train" , "passage" : "(Thanksgiving (United States)) Thanksgiving, or Thanksgiving Day, is a public holiday celebrated on the fourth Thursday of November in the United States. It originated as a harvest festival. Thanksgiving has been celebrated nationally on and off since 1789, after Congress requested a proclamation by George Washington. It has been celebrated as a federal holiday every year since 1863, when, during the American Civil War, President Abraham Lincoln proclaimed a national day of ``Thanksgiving and Praise to our beneficent Father who dwelleth in the Heavens,'' to be celebrated on the last Thursday in November. Together with Christmas and the New Year, Thanksgiving is a part of the broader fall/winter holiday season in the U.S." , "question" : "is thanksgiving sometimes the last thursday of the month?" , "hard_label" : "True" , "soft_label" : 0.75 , "roberta_hard" : true , "ind_human_label" : "?" }

Dataset: https://github.com/allenai/natural-perturbations

SubjQA: A Dataset for Subjectivity and Review Comprehension

Paper: https://virtual.2020.emnlp.org/paper_main.595.html

domain , question , review , human_ans_spans , human_ans_indices , question_subj_level , ques_subj_score , is_ques_subjective , answer_subj_level , ans_subj_score , is_ans_subjective , nn_mod , nn_asp , query_mod , query_asp , item_id , review_id , q_review_id , q_reviews_id electronics , How well does the speaker work for you? , "To those who think these speakers are too quiet, I suggest you check your ""Sound"" settings (in Mac). Go to ""Output"" and check whether ""Logitech"" is selected, or internal speakers. I'm sure Windows instructions are similar -- Control Panel, Sound, something.If that doesn't work, turn your hearing aid up.This thing is incredibly designed, they should win an award for it. I think it was designed around the 13"" Macbook Pro (that's what I have). It nails at least 7 of Dieter Rams' 10 Principles of design, most especially ""Good design is unobtrusive.""My only complaint is that the sound mix is a little mid-heavy for my taste. That's pretty nitpicking for $40 speakers that hang like a ninja behind my laptop monitor, but I thought I'd share that observation as well. ANSWERNOTFOUND" , , speakers are too quiet , "(25, 47)" , 1 , 0.0 , False , 1 , 0.3333333333333333 , False , quiet , speaker , bad , speaker , B003VAK1I2 , 7a6efb45bd32de268c3a7868d313da0a , d38214a266310836090f4e49bc9f6dbb , e9865435e082e8d4f2cad9ecefa685c4

Dataset: https://github.com/megagonlabs/SubjQA

Learning to Explain: Datasets and Models for Identifying Valid Reasoning Chains in Multihop Question-Answering

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.10/

Dataset: https://allenai.org/data/eqasc

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.525

Dataset: https://storium.cs.umass.edu

GLUCOSE: GeneraLized and COntextualized Story Explanations

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.370

Dataset: https://tinyurl.com/yyeo92pt

NLP Applications

Improving the Efficiency of Grammatical Error Correction with Erroneous Span Detection and Correction

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.581/
Problem: Improve the efficiency of Grammatical Error Correction (GEC)
Solution:

Result:

Information Extraction

Counterfactual Generator: A Weakly-Supervised Method for Named Entity Recognition

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.590/
Problem: improve few-shot NER by argumentation
Solution:

Result:

Semantics: Lexical Semantics

BERT Knows Punta Cana is not just beautiful, it’s gorgeous: Ranking Scalar Adjectives with Contextualised Representations

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.598/
Problem:scalar adjectives ranking

Solution:

Result:

Digital Voicing of Silent Speech

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.445/
Problem:
Convert Slient EMG to Vocalized Audio

Solution:

Two tricks: Alignment and canonical correlation analysis(CCA)
Alignment: E’s(EMG slient), E’v(EMG Vocalized), A’v(Audio Vocalized)

canonical correlation analysis(CCA):

Result:

Interpretability and Analysis of Models for NLP

Pretrained Language Model Embryology: The Birth of ALBERT

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.553
Problem: Investigation of training Albert model from start.
Solution:
ALBERT in pretraining phase every N parameter update steps and study what it has learned and what it can achieve so far.

investigate the development of predicting and reconstructing tokens

world knowledge

downstream task performances

Result:

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.555
Problem: The role of position embedding in transformer model.

Solution:

Absolute position regression — Can convert absolute position from embedding
Relative position regression — Can capture order info from embedding

Unclear relationship on Bert& RoBERTa model

Result:

Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.557/
Problem: Will bert has common sense on number?

Solution:

Result:

Attention is Not Only a Weight: Analyzing Transformers with Vector Norms

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.574/
Problem: Analytic of attention based on the norm of the transformed input vectors.
Solution:

instead of using softmax weight, use weight summed result.

Result:

On the weak link between importance and prunability of attention heads

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.260/
Problem: how different pruning strategies on heads changes NLP tasks result.
Solution:
Can we randomly prune attention heads?

Pruning layer on Bert

Result:

ETC: Encoding Long and Structured Inputs in Transformers

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.19/
Problem: reduce transformer attention complexity to scale input length.
Solution:

CPC — sentence level MLM

Result:

Calibration of Pre-trained Transformers

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.21/
Problem: can we trust model predicted result

Solution:

Result:

Question Answering

What Does My QA Model Know? Devising Controlled Probes using Expert

Paper: https://arxiv.org/abs/1912.13337
Problem: testing multiple-choice QA ability
Solution:
building diagnostic task — asking question about abstract knowledge

Result:

Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.467/
Problem: generate both answerable and unanswerable questions for data argumentation.
Solution:
Train a MRC model that can tell answer position and answerable or not.
compress question embedding from MRC model using encoder.
directly decode for answerable question
for unanswerable question, justify embedding before decode.

  • update encoder q until mrc model trend to predict unanswerable.
  • prevent update editing of latent representations too much.
  • prevent result not similar to original one.

Result:

Summarization

Q-learning with Language Model for Edit-based Unsupervised Summarization

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.34/
Problem: Unsupervised summarization
Solution:
compress and decompress context (like autoencoder)

how to generate compressed sentences ?By editing source context!
Use a Q function to decide which action to take — delete, keep, and replace, then generate summarization based on action(MLM to perform replace).

Agent update measure — step reward, violation penalty, and summary assessment

step reward encourages model can reconstruct removed words.

Violation Penalty — cr stands for compression rate, rr stands for reconstruction rate, make sure them higher than lower bound at each time step

summary assessment take three perspectives into account: informativeness, shortness, and fluency. The informativeness refers to how much y retains the original meaning of x, and the shortness and fluency are self-explanatory.

Overall

Result:

Multi-Fact Correction in Abstractive Text Summarization

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.749/
Problem: system-generated abstractive summaries often face the pitfall of factual inconsistency: generating incorrect facts with respect to the source text.
Solution:
QA-Span Fact Correction Model:
one entity: Iterative mask entity and replace it from source using span-based qa model.

Auto-regressive Fact Correction model:
multi-entity: Mask out all entity, iterative predict each mask with beam search.

Result:
factual consistency measures (QGQA and FactCC)

Brainstorm:
Will that be slow 😂

Machine Translation and Multilinguality

Simulated multiple reference training improves low-resource machine translation

Paper: https://arxiv.org/abs/2004.14524
Problem: translation should have many results.However, the lack of training data limiting this possibility.
Solution:
Create multiple targets as a paraphrasing problem, changing the target from predicting target token to predicting paraphraser distribution.

Result:

distribution in the training objective and sampling with distribution both increase the performance.

this method works better than back translation

Also works better than data augmentation

An Empirical Study of Generation Order for Machine Translation

paper: https://www.aclweb.org/anthology/2020.emnlp-main.464.pdf
problem: an investigation of text generation order
solution:
propose a framework to overcome different order generation — insertion transformer(given context, return one or more words with its position)
Result:
Transformer model seems to have no limitation on its order.

Brainstorm:
Different position encoding strategies may indicate different results.

Inference Strategies for Machine Translation with Conditional Masking

paper:https://arxiv.org/pdf/2010.02352.pdf
problem: To speed up generation and sustain its performance.
the strategy of unmasking token will inference prediction speed.

solution:

result:
comb-thresh

Non-Autoregressive Machine Translation with Latent Alignments

paper: https://www.aclweb.org/anthology/2020.emnlp-main.83/
problem: improve non-autoregressive generation

  1. independent output leads to token repetitions.
  2. requirement of output length prediction as a pre-process.

solution:
apply latent alignment models to solve.

non-autoregressive method-CTC, semi-autoregressive-Imputer

** With Distillation **
Use data distilled from an autoregressive teacher for training our models. We use autoregressive based Transformers for generating distilled data.

result:

Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation

paper: https://www.aclweb.org/anthology/2020.emnlp-main.211/
problem:
prune transformer heads to get faster inference time.
solution:
lottery ticket hypothesis: some parts of the network were luckily initialized and perform most of the work.

result:

--

--