[Submitted on 25 Oct 2021]
Abstract: The One Billion Word Benchmark is a dataset originated from the WMT 2011 News.
Crawl, frequently utilized to determine language modeling capability in natural language.
processing. We train designs entirely on Common Crawl web scrapes segmented by.
year, and show that they carry out even worse on this job in time due to.
distributional shift. Analysis of this corpus exposes that it includes a number of.
examples of damaging text, in addition to out-of-date recommendations to existing occasions. We.
recommend that the temporal nature of news and its circulation shift gradually.
makes it improperly matched for determining language modeling capability, and talk about.
possible effect and factors to consider for scientists developing language designs.
and assessment datasets.
From: Helen Ngo[view email]
Mon, 25 Oct 2021 02: 41: 27 UTC (238 KB)