Joshua Benton of Nieman Lab writes about BloombergGPT, which aims to be a domain-specific artificial intelligence for business news.

Benton writes, “How big is BloombergGPT? Well, the company says it was trained on a corpus of more than 700 billion tokens (or word fragments). For context, GPT-3, released in 2020, was trained on about 500 billion. (OpenAI has declined to reveal any equivalent number for GPT-4, the successor released last month, citing ‘the competitive landscape.’)

“What’s in all that training data? Of the 700 million-plus tokens, 363 billion are taken from Bloomberg’s own financial data, the sort of information that powers its terminals — ‘the largest domain-specific dataset yet’ constructed, it says. Another 345 billion tokens come from ‘general purpose datasets’ obtained from elsewhere.

“The company-specific data, named FinPile, consists of ‘a range of English financial documents including news, filings, press releases, web-scraped financial documents, and social media drawn from the Bloomberg archives.’ So if you’ve read a Bloomberg Businessweek story in the past few years, it’s in there. So are SEC filings, Bloomberg TV transcripts, Fed data, and ‘other data relevant to the financial markets.’”

