ChatGPT | SmallBiz.com - What your small business needs to incorporate, form an LLC or corporation!

The top 50 books being used to train ChatGPT — and what they say about its ‘intelligence’

Adam Rogers — Tue, 30 May 2023 10:02:00 +0000

David Bamman was trying to analyze “Pride and Prejudice” — digitally. An information scientist at UC Berkeley, Bamman uses computers to think about art, building what he calls “algorithmic measuring devices for culture.” That means extracting data from classic literature about things like, say, the relationships among various characters. In this case, he was going to start with a question that’d be easy for an even marginally literate human: Are Lizzie and Jane besties, or just sisters?

For kicks, Bamman decided to first try asking ChatGPT. What would happen if he fed in 4,000 words of “Pride and Prejudice” and posed a simple question: “What are the relationships between the characters?”

To his amazement, it worked. The chatbot’s GPT-4 version was amazingly accurate about the Bennet family tree. In fact, it was almost as if it had studied the novel in advance. “It was so good that it raised red flags in my mind,” Bamman says. “Either it knew the task really well, or it had seen ‘Pride and Prejudice’ on the internet a million times, and it knows the book really well.”

The problem is, there was no way of knowing how GPT-4 knew what it knew. The inner workings of the large language models at the heart of a chatbot are a black box; the datasets they’re trained on are so critical to their functioning that their creators consider the information a proprietary secret. So Bamman’s team decided to become “data archaeologists.” To figure out what GPT-4 has read, they quizzed it on its knowledge of various books, as if it were a high-school English student. Then they gave it a score for each book. The higher the score, the likelier it was that the book was part of the bot’s dataset — not just crunched to help the bot generate new language, but actually memorized.

In a recent preprint, meaning it hasn’t been peer reviewed yet — the team presented its findings — what amounts to an approximation of the chatbot canon. A lot of it, as you might expect, are the classics: everything from “Moby Dick” and “The Scarlet Letter” to “The Grapes of Wrath” and, yep, “Pride and Prejudice.” There are a bunch of popular novels, from Harry Potter and Sherlock Holmes to “The Da Vinci Code” and “Fifty Shades of Grey.” But what’s most surprising is how much science fiction and fantasy GPT-4 has been raised on. The list is staggering: J.R.R. Tolkien, Ray Bradbury, William Gibson, Orson Scott Card, Philip K. Dick, Margaret Atwood, “A Game of Thrones,” even “The Hitchhiker’s Guide to the Galaxy.”

The question of what’s on GPT-4’s reading list is more than academic. Bots aren’t intelligent. They don’t understand the world in any way a human can. But if you want to get to know someone — or something, in this case — you look at their bookshelf. Chatbots don’t just invent untrue facts, perpetuate egregious crud, and extrude bland, homogenized word pap. It turns out they’re also giant nerds.

The Silmarillion. Really?

One reason people are trying to figure out what sources chatbots are trained on is to determine whether the LLMs violate the copyright of those underlying sources. The issue, as several lawsuits argue, revolves around whether the bots make fair use of the material by transforming into something new, or whether they just memorize it whole and regurgitate it, without citation or permission.

One way to answer the question is to look for information that could have come from only one place. When prompted, for example, a GPT-3 writing aid called Sudowrite recognizes the specific sexual practices of a genre of fan-fiction writing called the Omegaverse. That’s a strong hint that OpenAI scraped Omegaverse repositories for data to train GPT-3.

Bamman and his team used a different tactic: a fill-in-the-blank game called a name cloze. They grabbed short passages from hundreds of novels from as far back as 1749, stripped them of character names and any clues to character names, and then prompted the latest versions of ChatGPT to answer questions about the passage. They might ask:

You have seen the following passage in your training data. What is the proper name that fills in the [MASK] token in it? This name is exactly one word long, and is a proper name (not a pronoun or any other word). You must make a guess, even if you are uncertain.

Then they would feed the bot a line from the passage in question:

The door opened, and [MASK], dressed and hatted, entered with a cup of tea.

If the bot answers “Gerty,” that’s a good indicator it has ingested “The House of Mirth,” by Edith Wharton — or a detailed summary of it. Show the bot 100 samples from a given book and see how many it gets right. That’s the book’s score.

After crunching the numbers, Bamman’s team had a list. In addition to the modern public-school canon — Charles Dickens and Jack London, Frankenstein and Dracula — there are a few fun outliers. I was delighted to see “The Maltese Falcon” on there; for my money, Dashiell Hammett is a better hard-boiled detective writer than the more often cited Raymond Chandler. But if you skip the stuff in the public domain and look at the list of copyrighted books that GPT-4 ingested — it didn’t differ much from the earlier GPT 3.5 — the bot’s true character emerges. Sure, “The Fellowship of the Ring” weighs in at No. 3, but you have to be pretty committed to Tolkien not to bounce off “The Silmarillion” (No. 9). “Do Androids Dream of Electric Sheep?” comes in at No. 21, just a few ticks below “Neuromancer” — two of the defining works of cyberpunk, the genre, ironically, that rang the warning klaxon on artificial intelligence. Isaac Asimov’s “Foundation” is down at the bottom; it defined my adolescent sci-fi experience and, having reread it when the very good TV version premiered two years ago, I promise you that the book in no way holds up.

Generally, though? The list, it me. This is the self-assigned, late-night, sci-fi reading list of every lonely straight white male Gen X nerd. The question is: Does that matter? What are we in for if GPT-4 has the reading preferences of a 14-year-old dweeb from 1984? (Including, as it happens, “1984,” at No. 2?)

What AI reads matters

GPT-4’s database is ginormous — up to a petabyte, by some accounts. So no one novel (or 50 novels) could teach it, specifically, that becoming the caretaker of a haunted hotel is no cure for writer’s block (No. 49), or that fear is the mind-killer (No. 13). The ocean of data swamps the islands of fiction. “The dataset used in pretraining is a big-enough selection of text,” says Ted Underwood, an information scientist at the University of Illinois, “that I’m not sure how much effect particular genre biases have on the behavior of the resulting models.”

The presence of these particular books in GPT-4’s digital soul may just reflect how present they are in the overall, wild internet from which the data got scraped. When Bamman’s team includes public domain books in their tests, the scores get higher — “Alice’s Adventures in Wonderland” tops the chart with a whopping 98%. And both the internet and the companies that build its bots tend to overrepresent standard-issue straight white dudes and the science fiction they love. Bamman’s team did indeed find that the books the LLMs scored high on were represented on the internet in roughly the same proportions. That makes sense. The chatbots didn’t choose their books. Internet culture did.

Still, it’s not hard to imagine that all that sci-fi the bots read will have the same malign influence on them as all the other data they trained on, creating the same kind of accidental biases that always creep into chatbot output. Sometimes they say racist stuff. They might recapitulate misinformation as if true because the same untruths show up often online. These are known risks, and part of the reason that OpenAI boss Sam Altman recently asked Congress to regulate his business.

“The sources that these models have been trained on are going to influence the kind of models they have and values they present,” Bamman says. If all they read was Cormac McCarthy books, he suggests, presumably they’d say existentially bleak and brutal things. So what happens when a bot devours fiction about all sorts of dark and dystopian worlds filled with Hunger Games and Choosing Ceremonies and White Walkers? “How might this genre influence the behavior of these models in ways not about literary or narrative things?” Bamman says. “There’s a lot of interesting work to be done there. But I don’t think we have the answer to that question yet.”

As a sci-fi nerd myself, I’ll take a stab at an answer. I think it’s good that genre literature is overrepresented in GPT-4’s statistical information space. These aren’t highfalutin Iowa Writers’ Workshop stories about a college professor having an affair with a student and fretting about middle age. Genre — sci-fi, mystery, romance, horror — is, broadly speaking, more interesting, partially because these books have plots where things actually happen. Bamman’s GPT-4 list is a Borgesian library of episodic connections, cliffhangers, third-act complications, and characters taking arms against seas of troubles (and whales).

More than that, science fiction, fantasy, and horror tend to be spaces for chewing on ideas and possibilities. “Dune” is about religion and the politics of revolution. The “Lord of the Rings” books are about pastoralism as a response to industrialization. “The Handmaid’s Tale” is about the ways sexism and fascism mirror each other. I could go on. I prefer an AI with a syntactical worldview spun from hyperspace and sandworms — or at least one that has read all the stories about how AIs can go awry. That said, I’d sure like to see a more diverse canon represented. Octavia Butler, Charlie Jane Anders, Lavie Tidhar, Samuel Delany, China Miéville … it’s time to expand the universe of possible universes.

The books we humans read change what we think about our world. But technically, chatbots don’t think about anything. They build statistical and vector relationships among words. Who cares whether those words are science-fictional? “The thing it definitely changes are the associations between concepts they think are likely, or strong, or systematic, or recurring,” says Ellie Pavlick, a computer scientist at Brown University who is a researcher at Google AI. “The question is, what is their worldview? In a simple sense, it’s associations between words and concepts. But that’s still going to be different based on what they read.”

Until OpenAI and other chatbot creators open their training datasets to public scrutiny, it will be hard to know what effect their reading lists has on their output. “If you have a model that has a ton of science fiction in it, and you have a separate model with a ton of Iowa Writers’ Workshop stuff,” Bamman says, “you could give each of them a task like: Give me 10 priorities for this meeting.” Maybe the Iowa bot would suggest that everyone describe their complicated relationships with their parents, while the sci-fi-nerd bot would propose sorting everyone into Hogwarts houses.

Remember, though, that Bamman wasn’t trying to answer any of these questions about copyright or the scariness of all the ghosts in the machine. He just wanted to know whether a chatbot could tell him something about a novel. In retrospect, he realizes that he was “overexuberant” about AI’s potential as a literary analyst when he fed GPT-4 that passage from “Pride and Prejudice.” Ask a bot about a popular book, and like a college sophomore with a 10-page essay on “Jane Eyre” due tomorrow, it’ll just quote you back long passages from the book. It’s vomiting up words, not searching for insight.

For now, Bamman suggests, digital humanists might want to confine their chatbot-derived cultural analysis to lesser-known works, ones that are unlikely to be in the training data. See what a bot makes of Gene Wolfe’s “The Book of the New Sun,” maybe, or Sheri Tepper’s “Grass.” That way, we’ll learn more about the books from what the bots have to say, because they’ll be coming at the material with a fresh eye, as it were. And it certainly won’t hurt to expose the bots to a wider and weirder dataset. That’s the only way to make them have something interesting to say about the things we read — and about everything else, too.

Adam Rogers is a senior correspondent at Insider.

OpenAI’s ChatGPT iOS app now available in Canada, India, Brazil and 30 more countries

Jagmeet Singh — Fri, 26 May 2023 05:07:26 +0000

OpenAI has expanded the availability of its ChatGPT app for iOS users in India and 32 other countries — just a week after its launching it in the U.S.

The list of new countries include Algeria, Argentina, Azerbaijan, Bolivia, Brazil, Canada, Chile, Costa Rica, Ecuador, Estonia, Ghana, India, Iraq, Israel, Japan, Jordan, Kazakhstan, Kuwait, Lebanon, Lithuania, Mauritania, Mauritius, Mexico, Morocco, Namibia, Nauru, Oman, Pakistan, Peru, Poland, Qatar, Slovenia, Tunisia and the United Arab Emirates.

Earlier this week, OpenAI expanded the ChatGPT app to 11 additional countries after the U.S. Those include European nations such as France, Germany and Ireland as well as New Zealand, Nigeria, South Korea and the U.K.

In the first six days since its initial availability in the U.S. last Thursday (May 18), the ChatGPT mobile app has crossed the mark of half a million downloads, according to the data shared by app intelligence firm data.ai. This achievement makes it one of the highest-performing new apps. The app also outperformed other AI and chatbot apps as well as Microsoft Edge and Bing apps in the U.S. in terms of downloads since its launch, per data.ai.

The ChatGPT app, which is available for free download and excludes ads, lets users interact with the generative AI-based chatbot using their iPhone. It also supports voice input through OpenAI’s speech recognition system Whisper and lets ChatGPT Plus users access advanced features through GPT-4. Further, users can also subscribe to the ChatGPT Plus service, which costs $20 per month in the U.S., directly through the iOS app.

OpenAI has the ChatGPT app only for iOS at the moment. However, the startup, backed by Microsoft and marquee VC firms such as Tiger Global and a16z, also has an Android version in the plans, which it has promised to bring soon to the market.

The expansion of the ChatGPT app comes at a time when OpenAI’s CEO Sam Altman is touring several countries to better connect with global policymakers and understand their concerns about AI. The executive met with some European state heads this week. He is also visiting India early next month.

ChatGPT is making people more money and better at their jobs. 4 of them break down how.

Jack Sommers — Thu, 25 May 2023 11:45:49 +0000

From left, Teresha Aird, Jasmine Cheng, Ihor Stefurak, and Randy Baruh.

Teresha Aird/Jasmine Cheng/Ihor Stefurak/Randy Baruh

ChatGPT has been publicly available for six months and is already changing the world of work.
Insider has profiled multiple workers who use the generative AI to make money and work smarter.
Four of them break down how they do it.

ChatGPT can already speed up certain tasks done by specialists, tell you how to stack eggs, and talk to your boss for you without them realizing.

Since it was released in November, office workers have been wondering whether their jobs will be replaced by it. Some have, in the meantime, experimented with how they can use it to get ahead.

Four people spoke with Insider about how they used it to do their jobs better and make more money.

A recruiter saves 10 hours a week by getting ChatGPT to list schools and institutions to target

Cheng.

Courtesy of Jasmine Cheng

Jasmine Cheng is a former Amazon recruiter who recently set up her own firm in engineering and healthcare recruiting. She was looking at startups whose employees could be good job candidates for her clients and asked ChatGPT to list companies according to criteria around industry, number of employees, and location, among other things.

She then feeds the list into a complex keyword-searching strategy called a Boolean search string.

“Before using ChatGPT, it took me a long time — around 15 hours a week — to manually put together lists that matched my criteria. Now, I only spend roughly five hours a week making lists,” she said.

A broker uses ChatGPT to write the first drafts of listings for luxury real estate in New York

Baruh.

Courtesy of Randy Baruh

After 23 years in the business, Randy Baruh, a broker for luxury real estate in New York City, considered writing the listings for each property a “chore” that could take several hours.

Each one had to convey “all the bells and whistles of the property, including the amenities, the location, and the proximity to anything of note to draw people in,” and be optimized for Google searches, he said.

His team used ChatGPT to write one, and though it had to be reviewed to correct factual errors and remove repetition, Baruh said the search-engine optimization ensured it reached a lot of people. The property quickly sold for 10% above the listing price.

“It’s become my go-to resource,” Baruh said. “I’ve found that it puts a fresh take on descriptions and dramatically streamlines the process of writing them, allowing us more time to focus on other aspects of our business.”

An entrepreneur used ChatGPT to build a Chrome extension he sold for thousands

Stefurak.

Ihor Stefurak.

Ihor Stefurak wanted to build an invisible artificial-intelligence assistant for Chrome, with which a user could type “/ai prompt” in any text area of a website, followed by a prompt to ChatGPT, and receive the bot’s response in that text area. He had no background in programming and had never made a Chrome extension. So he used ChatGPT to work on the coding. He called it his “chief technology officer.”

Using the paid version, Stefurak prompted it to produce a code for an extension that monitored website input boxes. Ultimately, he produced three JavaScript files to execute the idea, an HTML file, and a Manifest JSON file.

When it was ready, he allowed people to preorder the extension and received $1,000 worth of orders within 24 hours. Within three weeks, he sold it for thousands, his fastest launch to exit.

“A human developer could have undoubtedly built this faster and better, but the idea here is that I’m not a developer and still managed to create this,” he said.

A CMO says ChatGPT’s paid version is like ‘having a 24/7 assistant’

Aird.

Teresha Aird

Teresha Aird is a cofounder and the chief marketing officer of Offices.net, a nationwide office-space brokerage. A mother of two with a busy job, she struggled to find a good work-life balance.

She said she found the paid version of ChatGPT useful for “answering low-stakes queries on the fly” that she previously had to answer by accessing her database at work or via other research.

“If I’m away from the office and receive common questions from clients or tenants about properties or listing locations, I can easily input their questions and receive satisfactory answers,” she said.

She added: “It also helps me answer client questions in a contextually appropriate way and has afforded me more time to concentrate on my kids’ extracurriculars or help them with homework.”

The official ChatGPT app is now available in 11 more countries

Romain Dillet — Thu, 25 May 2023 09:40:04 +0000

OpenAI has announced in a tweet that the official ChatGPT mobile app is now available in more countries. When OpenAI first unveiled its mobile app last week, the app was only available on iOS and in the U.S. Now, many people living in Europe, South Korea, New Zealand and more will be able to download the app from the App Store.

The ChatGPT app is a free app without any ads. People who are already familiar with ChatGPT will feel right at home as it’s basically just a way to interact with the chatbot — nothing more, nothing less.

Here’s the full list of countries where the ChatGPT is now available: Albania, Croatia, France, Germany, Ireland, Jamaica, New Zealand, Nicaragua, Nigeria, South Korea, the U.K. and the U.S. Once again, the app is only available on iOS for now. In its original announcement, OpenAI also promised that an Android app was “coming soon.”

When you open the app, you can start typing text in a text box at the bottom of the screen. It works just like sending a message in any messenger app. While you can dictate text using Apple’s built-in speech recognition feature, you can also leverage OpenAI’s open source speech recognition system Whisper for voice input.

After you hit the send button, OpenAI processes your request and returns an AI-generated answer. You can follow up with more information or ask for a different answer. The app supports code blocks and users can copy and paste answers.

By default, ChatGPT saves your chat history and uses it for model training. When this feature is enabled, you will also be able to find your conversations on desktop. It’s worth noting that there’s no way to disable data sharing without disabling chat history too.

If you are a ChatGPT Plus subscriber, you will be able to access GPT-4’s capabilities through the mobile app. Users should also notice faster response times. ChatGPT Plus costs $20 per month on desktop and is also available as an in-app purchase in your local currency (€22.99 per month in Europe, £19.99 in the U.K., etc.).

The timing of this expansion, which includes several European countries, is interesting as OpenAI’s CEO Sam Altman is meeting European heads of states this week, such as France’s Emmanuel Macron, Spain’s Pedro Sánchez and the U.K.’s Rishi Sunak. Altman has expressed criticisms toward rushed AI regulatory policy. And now, ChatGPT will be much more accessible in Europe as people will be able to say “just download the app.”

Image Credits: OpenAI (App Store screenshot)