EMNLP 2020 Notes
Have a look at the newest paper on EMNLP 2020, introducing each paper on problem/solution/result perspective.
Machine Learning for NLP
Seq2Edits: Sequence Transduction Using Span-level Edit Operations
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.418
Problem: speed up seq2seq generation via editing.
Solution:
highly overlap input target can be done by replace/delete/copy.
Result:
PatchBERT: Just-in-Time, Out-of-Vocabulary Patching
Paper:
Problem: Improve Out-of-Vocabulary performance
Solution:
find a existed subword
Result:
Pre-Training Transformers as Energy-Based Cloze Models
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.20
Problem: effective pretraining model
Solution:
Whether ELECTRA or not?
ELECTRA can be viewed as a variant of Electric using negative sampling instead of noise-contrastive estimation.
Result:
Sequence-Level Mixed Sample Data Augmentation
Paper:
Problem: capture systematic compositionality
Solution:
Result:
Autoregressive Knowledge Distillation through Imitation Learning
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.494
Problem: autoregressive models compression
train input -> from teacher’s output
inference input -> from student’s previous output
training-inference inconsistency causes a decrease in generation quality.
Solution:
- the student model must be trained on its own state distribution so that it will perform better at generation.
- the teacher model should play the role of the oracle and correct the student’s generations at each time step.
populates its training set by adding new data generated from the oracle-learner mixture. It then re-trains the policy learner on the aggregated dataset at each iteration.
Result:
Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.634
Problem: fine-tune with less forgetting
Solution:
Result:
On Losses for Modern Language Models
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.403
Problem: investigate pre-training task to find a better pre-train way.
Solution:
Result:
Semantic Label Smoothing for Sequence to Sequence Problems
Paper:
Problem: apply label smoothing on seq2seq task
Solution:
Result:
Embedding Words in Non-Vector Space with Unsupervised Graph Learning
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.594
Problem: graph word embedding to capture hierarchical structure
Solution:
vertex(word), edge, and weight are learning objectives.
Step by step:
Objective — two node distance
Result:
Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.493
Problem: Find an optimal masking policy to further pre-training, which brings better task-specific results.
Solution:
Neural Mask Generator — Find the masking scheme that can best boost pre-training performance using deep RL
Result:
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.36
Problem: knowledge-distillation in the intermediate representations not consider each layer relation
Solution:
Average pooling on intermediate representations
Use constructive loss to narrow teacher and student representations, others input data will be negative sample
memory bank handles huge negative samples
Result:
Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.17
Problem: multi-head attention may suffer from attention collapse
Solution:
head parameter from a distribution
different head match different part of target distribution
Result:
code: https://github.com/bangann/Repulsive-Attention
Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.38
Problem: Lack of data in fine-tuning
Solution:
self-supervise meta-learning task
Result:
Lifelong Language Knowledge Distillation
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.233
Problem: catastrophic forgetting problem in lifelong learning
Solution:
Apply KD on LM
Result:
BAE: BERT-based Adversarial Examples for Text Classification
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.498
Problem: Adversarial attack on classification
Solution:
Result:
Grounded Compositional Outputs for Adaptive Language Modeling
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.96
Problem: word-level language model with a size that does not depend on the training vocabulary
Solution:
Result:
Language Generation
Plan ahead: Self-Supervised Text Planning for Paragraph Completion Task
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.529
Problem: Paragraph level MLM does not focus on topical content.
Solution:
Keywords guide generation
Result:
Augmented Natural Language for Generative Sequence Labeling
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.27
Problem: improve low-resource slot labeling by joint sequence labeling and sentence-level classification.
Solution:
Seq2Seq generative model
Result:
A* Beam Search
Paper: not provided
Problem: Speed up beam search
Solution:
Best first beam search
Only compute the best score path each time
Best -0.4
Best -0.6
Best -1.5
until output eos
Result:
COD3S: Diverse Generation with Discrete Semantic Signatures
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.421
Problem: Diverse sentence generation — same input, different semantic output
Solution:
semantic embedding — SBERT
LSH compress embedding into x bit
number of bits that can capture original embedding similarity info.(Preserves cos similarity)
decoding process both depend on LSH and input token.
Result:
Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.58
Problem: left-to-right language model that can consider future input.
Solution: off-the-shelf, left-to-right language models and no supervision.
regular decoding
decode future input z, get updated gradient vector
mix both vector, sampling new result
Repeat
Result:
Reformulating Unsupervised Style Transfer as Paraphrase Generation
Paper: http://style.cs.umass.edu
Problem: propose to improve semantic preservation in style transfer
Solution:
Reformulate style transfer as a controlled paraphrase generation task.
Diverse paraphraser — generate paraphrase sentence on each style sentence using Language Model(GPT)
this process can normalize each style result — Language Model learns a general way to generate sentences.
Inverse paraphraser — convert paraphrase sentence to original style sentence
Swap inverse paraphraser for style transfer!!!
Result:
Unsupervised Text Style Transfer with Padded Masked Language Models
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.699
Problem: non parallel style transfer
Solution:
Result:
Dialog and Interactive Systems
Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems
Paper: https://virtual.2020.emnlp.org/paper_main.2281.html
Problem: Reducing cost to evaluate bot with the human.
Solution:
Rank a set of bots, determent by bot-bot conversation instead of bot-human.
Humans decide whether the conversation is bot or human.
Long conversation will expose bot.
Result:
Dialogue Response Ranking Training with Large-Scale Human Feedback Data
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.28/
Problem: Dialog Response improve with human feedback
Solution:
Result:
Resources
More Bang for Your Buck: Natural Perturbation for Robust Question Answering
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.12/
{ "cluster-id" : "25938" , "question_id" : "267" , "is_seed_question" : 0 , "split" : "train" , "passage" : "(Thanksgiving (United States)) Thanksgiving, or Thanksgiving Day, is a public holiday celebrated on the fourth Thursday of November in the United States. It originated as a harvest festival. Thanksgiving has been celebrated nationally on and off since 1789, after Congress requested a proclamation by George Washington. It has been celebrated as a federal holiday every year since 1863, when, during the American Civil War, President Abraham Lincoln proclaimed a national day of ``Thanksgiving and Praise to our beneficent Father who dwelleth in the Heavens,'' to be celebrated on the last Thursday in November. Together with Christmas and the New Year, Thanksgiving is a part of the broader fall/winter holiday season in the U.S." , "question" : "is thanksgiving sometimes the last thursday of the month?" , "hard_label" : "True" , "soft_label" : 0.75 , "roberta_hard" : true , "ind_human_label" : "?" }
Dataset: https://github.com/allenai/natural-perturbations
SubjQA: A Dataset for Subjectivity and Review Comprehension
Paper: https://virtual.2020.emnlp.org/paper_main.595.html
domain , question , review , human_ans_spans , human_ans_indices , question_subj_level , ques_subj_score , is_ques_subjective , answer_subj_level , ans_subj_score , is_ans_subjective , nn_mod , nn_asp , query_mod , query_asp , item_id , review_id , q_review_id , q_reviews_id electronics , How well does the speaker work for you? , "To those who think these speakers are too quiet, I suggest you check your ""Sound"" settings (in Mac). Go to ""Output"" and check whether ""Logitech"" is selected, or internal speakers. I'm sure Windows instructions are similar -- Control Panel, Sound, something.If that doesn't work, turn your hearing aid up.This thing is incredibly designed, they should win an award for it. I think it was designed around the 13"" Macbook Pro (that's what I have). It nails at least 7 of Dieter Rams' 10 Principles of design, most especially ""Good design is unobtrusive.""My only complaint is that the sound mix is a little mid-heavy for my taste. That's pretty nitpicking for $40 speakers that hang like a ninja behind my laptop monitor, but I thought I'd share that observation as well. ANSWERNOTFOUND" , , speakers are too quiet , "(25, 47)" , 1 , 0.0 , False , 1 , 0.3333333333333333 , False , quiet , speaker , bad , speaker , B003VAK1I2 , 7a6efb45bd32de268c3a7868d313da0a , d38214a266310836090f4e49bc9f6dbb , e9865435e082e8d4f2cad9ecefa685c4
Dataset: https://github.com/megagonlabs/SubjQA
Learning to Explain: Datasets and Models for Identifying Valid Reasoning Chains in Multihop Question-Answering
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.10/
Dataset: https://allenai.org/data/eqasc
STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.525
Dataset: https://storium.cs.umass.edu
GLUCOSE: GeneraLized and COntextualized Story Explanations
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.370
Dataset: https://tinyurl.com/yyeo92pt
NLP Applications
Improving the Efficiency of Grammatical Error Correction with Erroneous Span Detection and Correction
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.581/
Problem: Improve the efficiency of Grammatical Error Correction (GEC)
Solution:
Result:
Information Extraction
Counterfactual Generator: A Weakly-Supervised Method for Named Entity Recognition
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.590/
Problem: improve few-shot NER by argumentation
Solution:
Result:
Semantics: Lexical Semantics
BERT Knows Punta Cana is not just beautiful, it’s gorgeous: Ranking Scalar Adjectives with Contextualised Representations
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.598/
Problem:scalar adjectives ranking
Solution:
Result:
Digital Voicing of Silent Speech
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.445/
Problem:
Convert Slient EMG to Vocalized Audio
Solution:
Two tricks: Alignment and canonical correlation analysis(CCA)
Alignment: E’s(EMG slient), E’v(EMG Vocalized), A’v(Audio Vocalized)
canonical correlation analysis(CCA):
Result:
Interpretability and Analysis of Models for NLP
Pretrained Language Model Embryology: The Birth of ALBERT
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.553
Problem: Investigation of training Albert model from start.
Solution:
ALBERT in pretraining phase every N parameter update steps and study what it has learned and what it can achieve so far.
investigate the development of predicting and reconstructing tokens
world knowledge
downstream task performances
Result:
What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.555
Problem: The role of position embedding in transformer model.
Solution:
Absolute position regression — Can convert absolute position from embedding
Relative position regression — Can capture order info from embedding
Unclear relationship on Bert& RoBERTa model
Result:
Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.557/
Problem: Will bert has common sense on number?
Solution:
Result:
Attention is Not Only a Weight: Analyzing Transformers with Vector Norms
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.574/
Problem: Analytic of attention based on the norm of the transformed input vectors.
Solution:
instead of using softmax weight, use weight summed result.
Result:
On the weak link between importance and prunability of attention heads
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.260/
Problem: how different pruning strategies on heads changes NLP tasks result.
Solution:
Can we randomly prune attention heads?
Pruning layer on Bert
Result:
ETC: Encoding Long and Structured Inputs in Transformers
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.19/
Problem: reduce transformer attention complexity to scale input length.
Solution:
CPC — sentence level MLM
Result:
Calibration of Pre-trained Transformers
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.21/
Problem: can we trust model predicted result
Solution:
Result:
Question Answering
What Does My QA Model Know? Devising Controlled Probes using Expert
Paper: https://arxiv.org/abs/1912.13337
Problem: testing multiple-choice QA ability
Solution:
building diagnostic task — asking question about abstract knowledge
Result:
Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.467/
Problem: generate both answerable and unanswerable questions for data argumentation.
Solution:
Train a MRC model that can tell answer position and answerable or not.
compress question embedding from MRC model using encoder.
directly decode for answerable question
for unanswerable question, justify embedding before decode.
- update encoder q until mrc model trend to predict unanswerable.
- prevent update editing of latent representations too much.
- prevent result not similar to original one.
Result:
Summarization
Q-learning with Language Model for Edit-based Unsupervised Summarization
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.34/
Problem: Unsupervised summarization
Solution:
compress and decompress context (like autoencoder)
how to generate compressed sentences ?By editing source context!
Use a Q function to decide which action to take — delete, keep, and replace, then generate summarization based on action(MLM to perform replace).
Agent update measure — step reward, violation penalty, and summary assessment
step reward encourages model can reconstruct removed words.
Violation Penalty — cr stands for compression rate, rr stands for reconstruction rate, make sure them higher than lower bound at each time step
summary assessment take three perspectives into account: informativeness, shortness, and fluency. The informativeness refers to how much y retains the original meaning of x, and the shortness and fluency are self-explanatory.
Overall
Result:
Multi-Fact Correction in Abstractive Text Summarization
Paper: https://www.aclweb.org/anthology/2020.emnlp-main.749/
Problem: system-generated abstractive summaries often face the pitfall of factual inconsistency: generating incorrect facts with respect to the source text.
Solution:
QA-Span Fact Correction Model:
one entity: Iterative mask entity and replace it from source using span-based qa model.
Auto-regressive Fact Correction model:
multi-entity: Mask out all entity, iterative predict each mask with beam search.
Result:
factual consistency measures (QGQA and FactCC)
Brainstorm:
Will that be slow 😂
Machine Translation and Multilinguality
Simulated multiple reference training improves low-resource machine translation
Paper: https://arxiv.org/abs/2004.14524
Problem: translation should have many results.However, the lack of training data limiting this possibility.
Solution:
Create multiple targets as a paraphrasing problem, changing the target from predicting target token to predicting paraphraser distribution.
Result:
distribution in the training objective and sampling with distribution both increase the performance.
this method works better than back translation
Also works better than data augmentation
An Empirical Study of Generation Order for Machine Translation
paper: https://www.aclweb.org/anthology/2020.emnlp-main.464.pdf
problem: an investigation of text generation order
solution:
propose a framework to overcome different order generation — insertion transformer(given context, return one or more words with its position)
Result:
Transformer model seems to have no limitation on its order.
Brainstorm:
Different position encoding strategies may indicate different results.
Inference Strategies for Machine Translation with Conditional Masking
paper:https://arxiv.org/pdf/2010.02352.pdf
problem: To speed up generation and sustain its performance.
the strategy of unmasking token will inference prediction speed.
solution:
result:
comb-thresh
Non-Autoregressive Machine Translation with Latent Alignments
paper: https://www.aclweb.org/anthology/2020.emnlp-main.83/
problem: improve non-autoregressive generation
- independent output leads to token repetitions.
- requirement of output length prediction as a pre-process.
solution:
apply latent alignment models to solve.
non-autoregressive method-CTC, semi-autoregressive-Imputer
** With Distillation **
Use data distilled from an autoregressive teacher for training our models. We use autoregressive based Transformers for generating distilled data.
result:
Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation
paper: https://www.aclweb.org/anthology/2020.emnlp-main.211/
problem:
prune transformer heads to get faster inference time.
solution:
lottery ticket hypothesis: some parts of the network were luckily initialized and perform most of the work.
result: