## 1 Introduction

[Textual] entailment inference is uncertain and has a probabilistic nature.

– Glickman:2005:PCA:1619499.1619502

Variants of entailment tasks have been used for decades in benchmarking systems for natural language understanding. Recognizing Textual Entailment (RTE) or Natural Language Inference (NLI) is a categorical classification problem: predict which of a set of discrete labels apply to an inference pair, consisting of a premise () and hypothesis (). The FraCaS consortium offered the task as an evaluation mechanism, along with a small challenge set cooper1996using, which was followed by the RTE challenges rte-1. Both employed a binary set of labels, which we here call *entailment* (ent), and *contradiction* (con). Subsequent challenges giampiccolo2007third; giampiccolo2008fourth shifted to a ternary label set, adding the category *neutral* (neu), as was adopted by recent NLI datasets such as SICK marco_marelli_2014_2787612, SNLI snli:emnlp2015 and MultiNLI williams2018broad.

Woman reaching for food at the supermarket. Woman is reaching for frozen corn at the store. |

The brown dog is laying down on a blue sheet. |

A dog is laying down on its side, sleeping. |

A tattooed woman puts on a motorcycle helmet. |

A woman is about to ride her motorcycle. |

*Neutral*premise-hypothesis pairs taken from SNLI, relabeled with subjective probability. Here denotes pair is labeled with subjective probability .

While researchers have recognized the inherent probabilistic nature of NLI, this has been primarily restricted to models of inference, in contrast to the task itself (see §2). Here we propose the task of *Uncertain Natural Language Inference* (UNLI), that shifts NLI away from categorical labels to the direct prediction of human subjective probability assessments (see Figure 1 for example). We illustrate that human-elicited probability assessments contains subtle distinctions on the likelihood of a hypothesis sentence conditioned on a context given by a premise sentence, far beyond a traditional ternary label (ent / neu / con) assignment. Further, we define UNLI models built upon BERT devlin2018bert that utilize recent advancements from large-scale language model pre-training, and provide experimental results illustrating that systems can often predict these judgments, but with clear gaps in understanding and in cases of logical incoherence.

UNLI is therefore a refinement of NLI that captures more subtle distinctions in meaning, that we can build models to target, and that we can collect data to support. We conclude that scalar annotation protocols such as employed here should be adopted in future NLI-style dataset creation, which should enable new work in modeling a richer space of interesting inferences.

## 2 Background

Uncertainty in NLI has been considered from a variety of perspectives. Glickman:2005:PCA:1619499.1619502 stated^{1}^{1}1 We rewrite prior work descriptions to consolidate on and for coherence; RTE’s “text” T becomes . *that probabilistically entails … if increases the likelihood of being true*.^{2}^{2}2 Glickman:2005:PCA:1619499.1619502: . Judges annotated a pair positively if they could infer hypothesis based on premise with high confidence, and negatively otherwise: the prediction task was categorical, with associated model scores meant to reflect probabilities. Models during training were not provided with annotations capturing subjective uncertainty.

babies elicited ordinal annotations reflecting likelihood judgments, then averaged these labels under an assumption of uniform scalar distance between ordinal categories.^{3}^{3}3 Annotators were asked to: *assume the premise “is true, or describes a real scenario” and then, using their best judgment, to indicate how likely it is, on a scale of 1 to 5, that the hypothesis “is also true, or describes the same scenario.”* The results were used for a manual analysis relating to the semantics of adjective-noun composition, but downstream use of the data in a model was restricted to casting the annotations to a ternary NLI classification problem.

uw-factuality and others averaged ordinal judgments from multiple annotators with regards to whether particular events mentioned in a sentence did or did not happen, with the resulting structured prediction task being modeled as scalar regression stanovsky2017fact; neural-models-of-factuality. Factuality data has been *recast* inference-is-everything-recasting-semantic-resources-into-a-unified-evaluation-framework into NLI form collecting-diverse-natural-language-inference-problems-for-sentence-representation-evaluation, but retains the traditional NLI categories.
semantic-proto-roles and universal-decompositional-semantics-on-universal-dependencies have similarly asked annotators to annotate semantic properties on an ordinal scale, with resulting data later recast to traditional NLI.

lai-hockenmaier:2017:EACLlong leveraged a collection of image captions with a hierarchical structure to construct a probabilistic entailment model, where they state that: *learning to predict the conditional probability of one phrase given another phrase would be helpful in predicting textual entailment*. We directly ask humans for probability assessments, on complete NLI pairs.

lalor-wu-yu:2016:EMNLP2016 and lalor-EtAl:2018:EMNLP attempt to capture uncertainty of each inference pair by Item Response Theory (IRT), which parameterizes the discriminative power of each inference pair: how easy is it to predict the gold label? For example, a pair (, ) with Contradiction as the gold label has high discriminative power when the pair is labeled (as Contradiction

) correctly by reliable human annotators, and vice versa. lalor-EtAl:2018:EMNLP uses IRT to estimate the discriminative power of each inference pair in a subset (180 pairs) from SNLI, showing fine-grained differences in discriminative power in each label. The IRT model relies on the discrete labels as oracle to determine the difficulty (discriminative power) of labeling each inference pair, while we propose a direct elicitation of subjective probability.

COPA roemmele2011choice and ROCStories mostafazadeh-EtAl:2016:N16-1 are examples of multiple-choice tasks which capture *relative* uncertainty between examples, but do not force a model to predict the probability of given .

TACL1082 made use of a protocol similar to babies, using this first to analyze various existing datasets such as SNLI, COPA and ROCStories, as well as their own automatically generated NLI hypotheses. For prediction, they advocated for an extended definition of NLI that increased the number of categories.^{4}^{4}4 A Likert 5-point scale: VeryLikely, Likely, Plausible, TechnicallyPossible, and Impossible. UNLI work can be viewed as a scalar version of their proposal.

li-etal-2019-learning viewed the plausibility task of COPA as a *learning to rank* problem, where the model is trained to assign the highest scalar score to the most plausible alternative given context. Our work can be viewed as an extension to this, with the score being an explicit human probability judgment instead.

Linguists such as eijck-lappin-12, goodman-lassiter-2015, Cooper2015-COOPTT-4 and bernardy-etal-2018-compositional have described models for natural language semantics that introduce probabilities into the compositional, model-theoretic tradition begun by those such as Davidson1967-DAVTAM-3 and Montague1973. Where they propose probabilistic models for interpreting language, we are concerned with illustrating the feasibility of eliciting probabilistic judgements on examples through crowdsourcing, and contrasting this with prior efforts that were restricted to limited categorical label sets.

Many works in AI (e.g., by lrrh or garrette-etal-2011-integrating) have proposed general language understanding systems with formal underpinnings based wholly or in part on probabilities. Here our focus is specifically on (U)NLI, as a motivating task for which we can gather data.

## 3 Uncertain NLI

We define UNLI by editing the definition by rte-1 for their original shared task, RTE-1:

We say that entails has subjective probability given if, typically, a human reading would infer that is most probably true has a chance of being true. This somewhat informal definition is based on (and assumes) common human understanding of language as well as common background knowledge.

Formally, given a premise and a hypothesis , a UNLI model should output an uncertainty score of the premise-hypothesis pair that correlates well with a human-provided subjective probability assessment. This is in contrast with a traditional 3-class NLI classification model , where is the 3-class label set.

#### Metrics

Given a UNLI dataset comprising premise-hypothesis-uncertainty triples, predictions of uncertainty can be computed as . We compute the Pearson correlation (), the Spearman rank correlation () and the mean square error (MSE) between and as the metrics to measure the performance of UNLI models.

These metrics measure both the *ranking* and the *regression* aspects of the model: Pearson measures the linear correlation between the gold probability assessments and the model’s output;
Spearman measures the ability of the model *ranking* the premise-hypothesis pairs with respect to their subjective probability; whereas MSE measures whether the model can recover the subjective probability value from premise-hypothesis pairs. Note that we desire a high and and a low MSE.

## 4 Data

We construct a UNLI dataset by eliciting subjective probabilities from crowdsource workers (Mechanical Turk) on presented premise-hypothesis pairs.
No new NLI premise-hypothesis pairs are elicited or generated, as our focus is on the *uncertainty* aspect of NLI. Owing to its familiarity within the community, we choose to illustrate UNLI via re-annotating a sampled subset of
SNLI snli:emnlp2015. For examples taken across the three categories con / neu / ent we elicit a probability annotation , resulting in what we will call *U-SNLI* (Uncertain SNLI) (see Table 1 for examples).We preferred SNLI over MultiNLI for this work owing to SNLI containing a subset of examples for which multiple neu hypotheses were collected per premise. TACL1082 reported a wide range of ordinal likelihood judgments collected across SNLI neu examples, and so we anticipated these multi-neutral premise examples to be good fodder for illustrating our points here.

There are 7,931 distinct premises in the training set of SNLI that are paired with 5 or more distinct neu hypotheses: we take these 5 for each premise in this subset as prompts in elicitation, resulting in 39,655 neu pairs, with additional 15,862 con and ent pairs combined. Altogether we call this our training set, with 55,517 pairs containing 7,931 distinct premises. Dev and test sets were sampled from SNLI dev and test respectively, again with heavy emphasis on neu examples (see Table 3).

## 5 Annotation

Our process was inspired by the Efficient Annotation of Scalar Labels (EASL) framework of sakaguchi2018efficient, which combines notions of direct and relative assessments into a single crowd-sourcing interface. Groups of items are put into lists of size , where such items are presented to a user in a single page view, each item paired with a slider bar (for example, one may present distinct items on one page view). The slider bar enables direct assessment by the annotator per item. The interface has an implicit relative assessment aspect in that performing direct assessment judgments of multiple items placed visually together in a single page view is meant to encourage cross-item calibration of judgments. Our individual items were premise hypothesis pairs, with instructions requesting a probability assessment (see Figure 2).

Annotators were asked to estimate how likely the situation described in the hypothesis sentence would be true given the premise. Example pairs were provided in the instructions along with suggested probability values (see Figure 3 for three such examples). Annotators were recommended to calibrate their score for a given element taking into account the scores provided to other elements in the same page view.

#### Interface

For each premise-hypothesis pair, we elicit a probability assessment in from annotators using an interface shown in Figure 2, in contrast to the uniform scale employed in the original EASL protocol. We modify the interface to allow finer-grained values near 0.0 and 1.0, following findings that humans are especially sensitive to values near the ends of the probability spectrum tversky1981framing.^{5}^{5}5This is called the certainty effect: more sensitivity to the difference between, e.g., 0% and 1% than 50% and 51%. Annotators were presented a numeric value based on a non-linear projection of the slider position (): .

We ran pilots to tune , finding that people often choose far lower probabilities for some events than was intuitive upon inspection, (e.g., just below 50%). Therefore, we employed different values depending on the range of or (Figure 4).

#### Qualification Test

Annotators were given a qualification test to ensure non-expert workers were able to give reasonable subjective probability estimates. We first extracted seven statements from

Book of Odds

bookofodds, and manually split the statement into a bleached premise and hypothesis. We then wrote three easy premise-hypothesis pairs with definite probabilities like ( = “A girl tossed a coin.”, = “The coin comes up a head.”, probability: 0.5). We qualify users that meet both criteria: (1) For the three easy pairs, their annotations had to fall within a small error range around the correct label , computed as . (2) Their overall annotations have a Pearson and Spearman . This qualification test led to a pool of 40 trusted annotators, which were employed for the entirety of our dataset creation.#### Incremental Annotation

Each item was doubly annotated. In the case where the difference between the first two annotations on the raw slider bar was greater than 2000, we elicited a third round of annotation. After annotation, the associated probability to a pair was the median of gathered responses.

#### Statistics

## 6 Model

We base our model for UNLI on top of the sentence pair classifier

^{6}

^{6}6 The neural architecture for MultiNLI williams2018broad in devlin2018bert. in BERT devlin2018bert to exploit recent advancements brought by large-scale language model pre-training. The original model for NLI first concatenates the premise and the hypothesis, with a special sentinel token (cls) inserted at the beginning and a separator (sep) inserted after each sentence, tokenized using WordPiece. After passing this concatenated token sequence to the BERT encoder, take the encoding of the first sentinel (cls; index: 0) token,

(1) |

and pass the resulting feature vector

through a linear layer to result in the one of the label set for traditional NLI.We modify this architecture to accommodate our scenario: the last layer of the network is changed from the 3-dimensional output to a *scalar*

output – the logit score. The sigmoid function

is applied so that the output lies in , as any probability should. Therefore the UNLI task is directly modeled as a*regression*problem, trained using a binary cross-entropy loss

^{7}

^{7}7 loss yields no significant difference. between the human annotation and the model output .

### 6.1 Training with SNLI

We establish baselines for the UNLI task by training with just the SNLI dataset and the original 3-way classification labels (i.e., without our annotated uncertainty scores in U-SNLI). For illustrative purposes, we denote the original SNLI dataset as a set of premise-hypothesis-label triples where .

#### Training via regression

We derive a surrogate function that maps any SNLI label to a surrogate score by taking the average of all probabilistic annotations bearing label in the U-SNLI training set.^{8}^{8}8 . The SNLI dataset is therefore mapped to a UNLI dataset and we use this mapped version to train our regression model.

#### Training via learning to rank

Since we focus on the *uncertainty* of NLI, we alternatively approach the problem as a *learning to rank* problem. Instead of regression, we train a model that could correctly *rank* the premise-hypothesis pairs according to the probability: . To this end, we train the UNLI model with a margin-based loss DBLP:conf/esann/WestonW99:

(2) |

where

is the margin hyperparameter. This is to say, the model learns to assign a higher score for

than if , ideally the gap being larger than .However, the summation in Equation 2 is over , which unfortunately has a computationally infeasible complexity, where is the number of samples in the dataset. Hence we only take the summation over subsets: (1) Shared-premise pairs: data pairs with identical premises are included — these pairs rank the probability of different hypotheses given the same premise: ;

(2) Cross-premise pairs: For each sample , we randomly sample other samples with different premises and lower probability^{9}^{9}9 We skip this for con samples in SNLI since there are no samples with lower probability. : . The union is used as training set, hence reducing the complexity to .

### 6.2 Training with U-SNLI

We employ two ways to train a UNLI model with our U-SNLI dataset: (1) Direct training: direct regression training with U-SNLI; (2) Fine-tuning: fine-tune any of these two SNLI-pre-trained model in §6.1 (namely, via regression and ranking) with U-SNLI.^{10}^{10}10 This is similar to what is done by babies, where they first pre-train on SNLI, then fine-tune the model using their *Add-One* pairs. Our scenario is slightly different since our regression task is not exactly the same as NLI, hence we first map the gold labels in SNLI to scalar judgments as described above.

## 7 Experiments and Discussion

#### Setup

We use the bert-base-uncased model, with the Adam optimizer kingma2014adam with an initial learning rate of and maximum gradient norm 1.0 for all these settings. For the SNLI ranking setting, the hyperparameters are tuned with the margin and the number of contrasting samples

. All models are trained for 3 epochs, where the epoch resulting in the highest Pearson

on the U-SNLI dev set is selected. We report results on both the U-SNLI dev and test set based on the selected model.#### Hypothesis-only baselines

Owing to the concerns raised with SNLI gururangan2018annotation; tsuchiya2018performance; poliak2018hypothesis about its *annotation artifacts*, we include a *hypothesis-only baseline* version for all our settings (see Table 4), where all premises are reduced to an empty string. These baseline systems achieved a correlation around 40%, corroborating with the findings in this thread of work that a hidden bias exists in the SNLI dataset that allows prediction from hypothesis sentences even if no context information is given by the premise. These baselines show this bias also exists in U-SNLI.

#### Main results

The results on the U-SNLI dataset can be found in Table 4. Just training by our annotated U-SNLI yields a reasonable 62.71% Pearson on test – however this is consistently improved by augmenting with pre-training on SNLI (under both pre-training settings, namely regression and ranking). Ranking consistently achieves a higher correlation than regression: in the SNLI-only settings without U-SNLI, about a 4% boost in Pearson correlation can be observed by switching from regression to ranking; in the U-SNLI fine-tuning scenarios, this switch results in about 0.6% increase.

Figure 8 illustrates the effect of fine-tuning with U-SNLI on model behavior. It can be seen that before using our U-SNLI data for fine-tuning (just using SNLI), under the surrogate regression setting, the model’s prediction concentrates on the 3 surrogate scalar values of the 3 SNLI classes (con / neu / ent). The learning to rank setting results in slightly more flexible probability assigments to premise-hypothesis pairs that also correlates better with elicited U-SNLI labels, as is supported by better Pearson scores. After fine-tuning with U-SNLI training set, the model learns smoother predictions for premise-hypothesis pairs, supported by the superior Pearson correlation score .

Note that the right-bottom corner of the heatmaps in Figure 8 (samples with gold U-SNLI labels and model predictions) are of high accuracy. This is in accordance with what TACL1082 found, that the entailments in SNLI dataset is of close to 1.0 probability, whereas the neu and con pairs exhibit a wider range of subjective probability values.

#### Errors

We present a selection of the samples in the U-SNLI dev set with some of the largest gap between the gold probability assessment in the U-SNLI dev set and the BERT-based model output (the best model we produced) in Table 5

. The BERT-based model seems to have learned lexicon-level inference (e.g.,

*race cars*

*going fast*, but ignored crucial information

*sits in the pits*), but fail to learn certain commonsense patterns (e.g.

*riding amusement park ride*

*screaming*;

*man and woman drinking at a bar*

*on a date*). These examples show that despite significant improvements from large-scale language model pretraining, commonsense reasoning and plausibility estimation are yet to be solved.

#### Human performance

We elicit additional annotations on U-SNLI dev set to establish a randomly sampled human’s performance on UNLI. We split the dev set into 3 parts, where each part is labeled by the annotators previously selected from the qualification test. We ensure that each item is new to its annotator (the annotator did not provide a label used in the creation of dev). Scores are then elicited for all premise-hypothesis pairs with no redundancy (one-way annotation). This setting approximates the performance of a randomly sampled human on U-SNLI, and is therefore a reasonable lower bound on the performance one could achieve with a dedicated, trained single human annotator.

The metrics are listed in Table 4. Our best models achieve a higher score than this human performance (Pearson : 67.97% > 62.28%), demonstrating they can achieve human-level inference on premise-hypothesis pairs sampled from SNLI.

#### Coherence

Since we define UNLI as a modification to RTE in terms of human responses to given , here we ask whether judgments by humans and separately our system are always coherent when the same premise is paired to different hypothesis that are mutually incompatible. We consider two examples (see Table 6) pulled from SNLI train, that we selected by hand owing to the premise establishing the potential for an intuitively common-sense, *finitely enumerable* set of alternatives. Based on the premise we manually constructed alternatives to an existing hypothesis such that (1) they are logically mutually exclusive; and (2) one of the hypotheses must reasonably hold given the premise. Specifically, a *preteen* must have an age in the range of , and the most commonsense alternatives to *lunch* include *breakfast* and *dinner*. We distribute these constructed pairs into separate HITs, making sure that no annotator is viewing two related premise-hypothesis pairs at the same time, employing 6-way redundancy (see Figure 9).

With respect to human judgments, we observe the sum of probabilities across the options exceeds 1.0 in both cases. That humans can be irrational in their probability assignments is known, and therefore this result not unexpected: in UNLI we have embraced human judgments in the definition, taking seriously the phrasing of the original RTE task.

Regarding our best model’s predictions on these examples, we first observe that its scores also lead to a summed over-estimate, with a distribution of values strikingly similar to the median human response in the *barbecue* example. Second we observe a clear error in the *girl* example, where BERT plus subsequent (U)NLI exposure did not appear to provide a definition of the word *preteen*.

Preteen girl with blond-hair plays with bubbles near a vendor stall in a mall courtyard. |

The girl is ten. |

Alternatives: one | two | | twelve |

Three young men standing in a field behind a barbecue smiling each giving the two handed thumbs up sign. |

Three men are barbecuing lunch. |

Alternatives: breakfast | lunch | dinner |

## 8 Conclusion and Future Work

We proposed a new task of directly predicting human likelihood judgments on NLI premise-hypothesis pairs, calling this Uncertain NLI (UNLI). We built the U-SNLI dataset as a proof of this concept, which contains NLI pairs sampled from SNLI and annotated with subjective probabilistic assessments in the form of a scalar quantity between 0 and 1, instead of the 3-way Contradiction / Neutral / Entailment classification labels commonly used.

We demonstrated that (1) eliciting supporting data is feasible, and (2) annotations in the data can be used for improving a scalar regression model beyond the information contained in existing categorical labels, using recent contextualized word embeddings (BERT) are established. Performance was on the level of humans, but still retaining non-human-like errors in some predictions.

We suggest future resource creation in NLI shift to UNLI. Regarding what data to (re-)annotate, we chose SNLI as the basis for our proof of concept for reasons earlier described, but there have been various works discussing concerns of SNLI, such as earlier referenced hypothesis-only artifacts. TACL1082 were concerned over the direct elicitation of hypothesis statements, and proposed a procedure consisting of automatic generation of common-sense hypotheses from SNLI premises, followed by human filtering and labeling: such a process could be adapted to UNLI. Based on common-sense errors observed in our model on U-SNLI, we would anticipate such a dataset proving a significant and interesting challenge.

Comments

There are no comments yet.