The Magic of Text-to-Video-AI魔法-文字生成影片-空中英語教室
The Magic of Text-to-Video AI魔法-文字生成影片(上)
空中英語教室 20240906 (文尾有ChatGPT譯文)
Hello friends, and welcome to Studio Classroom. We are so glad you’ve joined us for our lesson today. My name is Ann Marie.

And my name is John. Friends, today is a great day to learn something new.

It certainly is, so let’s do that together now. Friends, you may know Teacher John as Ernest Finder from the Fun Fact segment, and I wanted to let you all know a little more about him. John actually used to work in Hollywood.

That’s right, I did. In Los Angeles. Hollywood is where a lot of movies are made, you know.

Today we’re talking about a movie star briefly, right?

Yeah, we are. We’re talking about Will Smith. Have you seen a lot of different celebrities before? Did you ever see Will Smith?

No, I never saw Will Smith, but one time I was working as a waiter in Hollywood and I saw his kids. I think Jada and Willow Smith.

They’re pretty famous too.

Yeah, so there are a lot of celebrities out there. But what happens when AI, or artificial intelligence, makes fake celebrities?

Well, we’re going to learn about that and the magic of text-to-video. Sounds pretty exciting.

It does, so let’s get into our first reading of the day: "The Magic of Text-to-Video."

Another great leap for creative AI! In March of 2023, a notoriously bizarre video of Will Smith eating spaghetti was created using an early text-to-video AI tool called Model Scope and released on Reddit, a popular content-sharing platform. The technology, while impressive, was deeply limited in its ability to accurately represent the text prompt. Some users even described the video as nightmarish. Little did they know that a year later, text-to-video AI would be creating videos so realistic that they would blur the line between the digital and the real.

Hi everyone, welcome to Language Lab. I’m Jack.

我们先来看 “bizarre” 这个形容词,意思是奇异的或是古怪的。例如,"The artist’s exhibition was filled with bizarre paintings and sculptures"(这位艺术家的展览充满了怪异的画作和雕塑),或者 "The kids came upon a bizarre scene in the park where a group of people were dressed as characters from children’s books"(孩子们在公园里看见一个奇怪的场景,一群人打扮成童书中人物的样子),又或者是 "Seth’s dream was so bizarre that he still felt uneasy long after he woke up"(Seth 的梦太怪异了,以至醒来后久久感到不安)。

再来看 “blur” 这个字。作为动词,它的意思是使模糊。例如,"In the dim light of the sunset, the line between the sea and the sky began to blur"(在夕阳的微光中,海天之间的界线开始变得模糊不清),或者 "As Martha thought about her sick cat, tears blurred her vision"(当 Martha 想起她生病的猫时,眼泪模糊了她的视线)。"Blur" 也可以作为名词,意思是模糊的东西。例如,"The photo caught the runner crossing the finish line and turned him into a blur against the cheering crowd"(这张照片捕捉到那位跑者冲过终点线瞬间的模糊身影,后面是欢呼的人群)。

Well friends, we have another AI article. AI is really changing our world, isn’t it? The way that we work, the way that we live, and the way that we consume media—these are all things that have been affected by AI. So let’s get into our article today. How do we start here, John?

Well, we talk about another great leap for creative AI. If something is a leap, it means a move forward. Leap literally just means jump. But let’s review what AI means. A stands for artificial, right? And I stands for intelligence. If we use intelligence as a noun like that, it means that it’s actually its own thing. You could describe a person or some space creature as an intelligence. So here we talk about computer systems being an intelligence, and it’s kind of its own noun.

That’s a really good point. Well, we get into our text here, friends, and we read, "In March of 2023, a notoriously bizarre video of Will Smith eating spaghetti was created using an early text-to-video AI tool called Model Scope." Let’s stop there for a minute. There’s a few things we need to look at here. First of all, I want to just say I saw this video. It’s been over a year now. Yeah, it’s really bizarre.

That’s right, yeah. It’s kind of jerky and weird, and the face kind of moves in an unnatural way. But it’s definitely Will Smith. You see it and you’re like, "This looks like Will Smith," but it’s not going to take Will Smith’s job yet.

So, that’s what we’ll talk about. By the way, Ann Marie, we’ve got that word "notoriously." How can I use that in a sentence?

Well, "notoriously" just means something is famous or well-known, usually for something bad. So, for example, you can say a city is notorious for its bad traffic. Bad traffic isn’t a good thing. That’s right. So you could say that it’s known for its bad traffic, but if you use the word "notoriously," you’re saying that everybody knows, and it’s not a good thing. In my house, I’m notorious for my bad cooking.

So, you don’t want to be notorious. And this video was notorious because it was bizarre, which we talked about—it looks weird. And it was created in a special way using an early text-to-video AI tool. So that’s what we’re talking about today: text-to-video. This video AI tool was called Model Scope and it was released on Reddit. "Released" is a very useful verb. It means to make something, like a song or a movie, available to the public. So you could say, "The singer released her new song and everyone loved it."

Okay, let’s move on to the next sentence. "The technology, while impressive, was deeply limited in its ability to accurately represent the text prompt." Okay, friends, this is a key term for this. The text—so write this down—here’s the definition: A text prompt is a specific keyword or sentence that tells AI what you want it to do. So this is basically the input. This is what you’re inputting into the AI in order to get the output that you want.

Okay, and at first, this input creates these nightmarish videos. A nightmare is a scary dream. So, they don’t really look real. But little did they know, we read that a year later, text-to-video AI would be creating videos so realistic that they would blur the line between the digital and the real.

Okay, so digital is something that’s made on a computer, and of course, real is real life. So now the line is getting blurred. Well, there’s a lot more to learn about this. Let’s go to the second section of our reading and learn about AI together.

"The Magic of Text-to-Video."

If you’ve ever used a text-to-image generative AI tool like DALL-E, you’ll see that text-to-video tools work in a similar way. The user provides a prompt like "cat swimming in a fish tank," and the tool creates a video based on everything it knows about cats, water, fish, and the physics of how they might interact. The tool makes use of what are known as visual patches—building blocks of data that help the AI understand how everything in the scene should interact and progress frame by frame.

接下来看 "make use of" 这个片语。意思是利用。例如,“To improve her physical health, Cheryl decided to make use of the new gym that had opened up in her neighborhood”(为了改善身体健康,Cheryl 决定利用附近新开的健身房),或者 “The community decided to make use of an empty lot and turn it into a neighborhood garden”(社区决定利用一块空地用它来盖社区花园),再看一句 “In a new recipe, the chef made use of the spices she had brought back from South America”(厨师在新食谱中使用了她从南美洲带回来的香料)。

Thank you so much, Jack. Well, we read on here, friends. "If you’ve ever used a text-to-image generative AI tool like DALL-E, you’ll see that text-to-video tools work in a similar way."

Okay, we have a term we need to define here: generative AI. Here’s what it is: It’s a deep learning model that can generate high-quality text, images, and other content. So it’s a generator—you put input into it, and it generates some type of output. So it’s kind of like a factory for pictures or videos.

Cool. Well, this technology is growing very quickly, and we are given an example here in our reading. "If you have ever used a text-to-image generative AI tool like DALL-E, you’ll see that text-to-video tools work in a similar way." And how do they work? Well, the user provides a prompt like "a cat swimming in a fish tank." Okay, so you type in "cat swimming in a fish tank," and then what happens?

Well, the tool creates a video based on everything it knows about cats, water, fish, and the physics of how they might interact.

So, what is physics? Physics is the branch of science concerned with the nature and properties of matter and energy. It’s a very scientific term, which means that the AI tool has to understand how physics works and how objects might interact with each other.

There’s another word related to physics: physical. If you say something is physical, it means it has a tangible presence—you can touch it. Physics deals with how things touch and work together, and it includes concepts like magnets. Computers need to understand how the real world works if they’re going to create an accurate picture of it.

I really like that you used the phrase "to work with each other" or "interact with each other" because we see the word interact in this sentence. If you interact with something, it means that it has a relationship with it or comes into contact with it. For example, if two people work in the same office but don’t interact much, it means they’re both there but don’t talk very often. We often use the word interact to mean talk, so Emory and I interact with you guys and we learn English together.

Let’s keep reading. It says in our reading, "A tool makes use of what are known as visual patches," and we get a definition right there. What are visual patches? They are building blocks of data that help the AI understand how everything in the scene should interact and progress frame by frame.

A building block is a basic unit from which something is built. We know about building blocks as a toy—many children play with wooden blocks. Even though its a very simple toy, you can create many things with them. Similarly, in our article, building blocks refer to basic units used to construct something. For example, the building blocks of our bodies are cells. Or the building blocks of language are vocabulary and grammar. Keep adding building blocks to your English vocabulary, and soon youll be able to say anything.

We have a few interesting phrases here at the end: interact, progress. Interact means to act together, and progress means to go forward. We’ll need to progress in our lesson here. Now, let’s move on to our Info Cloud.

Hello everyone, welcome to Info Cloud. In this day and age, people receive more information than ever before. Do you think that’s a good thing or a bad thing, Rex?

It’s probably good and bad, but it is definitely getting harder to separate fact from fiction.

Indeed. That’s an excellent way to phrase it—to separate fact from fiction means to figure out what is true or real and what is not. With so much social media and so many sources of information, it is now much harder to know who is telling the truth than it was in the past. Facts are things that are true or things that actually happened, while fiction describes stories that are made up by writers and are not true. It can be disastrous when people view fact as fiction and fiction as fact. That’s why it is very important to slow down and analyze the information at hand with a clear mind so we can separate fact from fiction.

In the age of information overload, there are too many true and false messages coming at us. Therefore, we need to learn how to separate fact from fiction. Fact is the reality, while fiction is made-up stories. To separate fact from fiction is to distinguish between what is real and what is imaginary. It is very difficult to separate fact from fiction in this age. This is today’s Info Cloud. See you next time.

The magic of text-to-video tools, like OpenAIs Sora, are based on a diffusion system. These tools are trained to recognize objects so they can refine images by filtering out incorrect visual patches. This process could be compared to a worker melting raw gold multiple times to remove all its flaws.

Finally, let’s look at the word filter. It can be used as both a verb and a noun. As a verb, one meaning is to gradually appear or seep in. For example, “The photographer filtered the light through a piece of light fabric to create a soft effect.” Filter also means to clean or purify. For instance, “You need to filter the water here before you drink it as it is not pure,” or “Walt replaced the old water filter to make sure his drinking water was clean and pure.”

In this next sentence in our lesson, we see the phrase “based on.” This phrase is marked earlier in your lesson as one of those focus phrases, which are great for you to learn because we use them a lot in conversation. We read here, “Tools that are based on a diffusion system.” If something is based on something else, it means it is grounded in or starts from something else. For example, a movie could be based on a book. Sometimes the movie might not be very similar to the book, but it takes parts of the story from it.

So, we’re talking about OpenAI’s Sora. These tools are trained to recognize objects so they can refine images. We need to stop there again to talk about some of these terms. The word trained means to be taught a set of skills and to practice them. It’s not just a one-time thing but something that happens repeatedly. For instance, puppies are trained to behave, and AI systems are also trained to improve their performance.

To refine something means to make it better or improve it. Normally, "re-" means again, and "fine" means to make something better. So refining images involves improving them by removing incorrect visual patches. Visual patches just mean the parts of the picture that don’t work correctly. These machines learn what a good picture is by filtering out these incorrect parts.

Returning to our example of the video of Will Smith eating spaghetti, it’s hard to pinpoint exactly what’s wrong with it, but it’s just not right. The idea is to take those parts that aren’t quite right and refine them.

We have another sentence to look at: "This process could be compared to a worker melting raw gold multiple times in order to remove all its flaws." When we talk about raw gold, we mean it’s not yet processed or refined. Raw gold often contains other metals, so you melt it down several times until only the pure gold remains. It’s the same with images—the AI runs them through its process repeatedly to get a better result.

Alright, friends, we’ll be right back in just a moment after today’s fun fact.

Hello, fact friends! I’m Detective Ernest Finder, and I have a fun fact for you. Did you know the name Sora means “sky” in Japanese? It’s true! And "the sky’s the limit" means there is endless possibility. Sometimes we say "the sky’s the limit" for AI, but be careful. That’s today’s fun fact.

Friends, it’s the end of our lesson, and you know what that means—it’s time for a quiz. Are you ready?

Here we go: Fill in the blank. I will give you four choices. OpenAI Sora is based blank a diffusion system.
Is it:
Based on a diffusion system
Based in a diffusion system
Based with a diffusion system
Based by a diffusion system
Wow, this is a hard one. I think the base is like the bottom or starting point, so here I think it’s based on it, right? Because everything is built up after that. You’ve got it! The correct preposition is "on."

Well, friends, we’re not done talking about text-to-video AI. This is a really fascinating topic, and I know I’m learning a lot. We hope you’ll come back and learn with us right here on Studio Classroom.

大家好,歡迎來到《Studio Classroom》,我們很高興你們今天來參加我們的課程。我叫 Ann Marie。我的名字是 John。朋友們,今天是一個學習新知識的好日子,的確如此,所以我們一起來學習吧。
朋友們,你們可能知道 John 老師是《Fun Fact》中的 Ernest Finder。我想讓你們了解更多關於他的事。John 其實曾經在好萊塢工作過。沒錯,我曾在洛杉磯工作過。好萊塢是很多電影製作的地方。今天我們會簡單談論一位電影明星,對吧?對,我們要談的是 Will Smith。你曾經見過很多不同的明星,是否見過 Will Smith?
不,我沒有見過 Will Smith。不過有一次我在好萊塢當服務生時,我見過他的孩子,我想是 Jada 和 Willow Smith。他們也很有名。對,明星真的是很多,那如果人工智慧或 AI 創造出假明星會怎樣呢?我們將學習這個問題,以及文字轉視頻的魔力,聽起來很有趣,對吧?
確實如此,那我們來看看今天的第一篇文章——《文字轉視頻的魔力》。2023 年 3 月,利用一種早期的文字轉視頻 AI 工具 Model Scope,創造了一段著名的 Will Smith 吃義大利麵的奇異視頻,並在 Reddit 這個受歡迎的內容分享平台上發布。雖然這項技術令人印象深刻,但在準確呈現文本提示方面仍然非常有限。有些用戶甚至形容這段視頻為夢魘。
但沒過多久,一年後,文字轉視頻 AI 創造出來的視頻將變得如此逼真,以至於模糊了數位與現實之間的界線。

大家好,歡迎來到語言實驗室,我是 Jack。
我們先來看「Bazori」這個形容詞,它的意思是奇異的或古怪的。例如,“The artist’s exhibition was filled with bizarre paintings and sculptures”(這位藝術家的展覽充滿了怪異的畫作和雕塑),或是 “The kids came upon a bizarre scene in the park where a group of people were dressed as characters from children’s books”(孩子們在公園裡看見了一個奇怪的場景,一群人打扮成童書中的人物)。
再來,我們看看「Blur」這個詞。當動詞使用時,它的意思是使模糊,例如,“In the dim light of the sunset, the line between the sea and the sky began to blur”(在夕陽的微光中,海天之間的界線開始變得模糊)。或者,“As Martha thought about her sick cat, tears blurred her vision”(當 Martha 想起她生病的貓時,眼淚模糊了她的視線)。當「blur」作為名詞時,它的意思是模糊的東西,例如,“The photo caught the runner crossing the finish line and turned him into a blur against the cheering crowd”(這張照片捕捉到跑者衝過終點線的瞬間模糊身影,背景是歡呼的人群)。

好了,朋友們,我們有另一篇 AI 文章,因為 AI 正在改變我們的世界,對吧?無論是工作、生活還是媒體消費,這些都是 AI 影響的領域。所以,讓我們進入今天的文章。John,我們從哪裡開始?
我們談談 AI 的另一個重大進展。如果某件事是「leap」,那意味著它向前邁進了一步。Leap 字面上的意思是跳躍。但是讓我們再回顧一下 AI 的意思,A 代表人工(Artificial),I 代表智慧(Intelligence)。如果我們把智慧(Intelligence)當作名詞使用,那意味著它實際上是一個獨立的東西。你可以形容一個人或外星生物為智慧。這裡我們談的是計算機系統作為一種智慧,它是一個獨立的名詞。
我們繼續閱讀文章。2023 年 3 月,使用一種早期的文字轉視頻 AI 工具 Model Scope 創造了著名的 Will Smith 吃義大利麵的視頻。讓我們停一下,首先,我要說我看過這段視頻,已經超過一年了。這真的很奇怪。對,這段視頻看起來有點僵硬和奇怪,臉部的動作也不自然,但確實是 Will Smith。你看到它會覺得這看起來像 Will Smith,但它還不能取代 Will Smith 的工作。順便提一下,Emory,我們有那個詞「notoriously」,我怎麼用它來造句?
「Well, notoriously」這個詞的意思是某事物因為某種不好的特質而出名或為人所知。例如,你可以說某城市因為交通擁擠而聲名狼藉。擁擠的交通並不是一件好事,對吧?所以你可以說這座城市以交通擁擠而聞名,但如果你使用「notoriously」這個詞,你是在說大家都知道這件事,並且這並不是一件好事。例如,在我家裡,我因為烹飪技術差而聲名狼藉,所以你不想讓自己成為一個「notorious」的人。這段視頻因為奇異而聲名狼藉,我們之前提到過,它看起來很奇怪,而且是用一種特殊的方式創造出來的,使用了一個早期的文字轉視頻 AI 工具。
所以這就是我們今天要討論的內容——文字轉視頻。這個視頻 AI 工具叫做 Model Scope,並且在 Reddit 上發布。發布(released)是一個很有用的動詞,它的意思是讓某些東西,如歌曲或電影,公眾可以使用。所以你可以說歌手發布了她的新歌,大家都很喜歡。
好,我們來看看下一句。雖然這項技術令人印象深刻,但在準確呈現文本提示的能力上仍然非常有限。朋友們,這是一個關鍵術語,請記下來:文本提示(text prompt)是指告訴 AI 你想要它做什麼的具體關鍵字或句子。這基本上是你輸入到 AI 中的內容,以便獲得你想要的輸出。
一開始,這些輸入會創造出一些像噩夢般的視頻。噩夢(nightmare)是一種可怕的夢,所以這些視頻看起來不太真實。但沒過多久,一年後,我們讀到文字轉視頻 AI 會創造出如此逼真的視頻,以至於會模糊數位和現實之間的界線。數位(digital)是指在電腦上製作的東西,當然,現實(real)是指現實生活中的東西,所以現在界線變得模糊了。還有很多東西需要了解,我們來看看我們閱讀的第二部分,一起了解 AI 的魔力吧。
如果你曾經使用過像 Dolly 這樣的文字生成圖片 AI 工具,你會發現文字轉視頻工具的工作方式也很相似。使用者提供像「貓在魚缸裡游泳」這樣的提示,然後工具會根據它知道的關於貓、水、魚以及它們如何互動的物理知識來創造一段視頻。這個工具利用了所謂的「視覺片段」(visual patches),這些是數據的基礎建塊,幫助 AI 理解場景中的一切如何逐幀互動和發展。

接下來,我們來看「make use of」這個片語,它的意思是利用。例如,“To improve her physical health, Cheryl decided to make use of the new gym that had opened up in her neighborhood”(為了改善身體健康,Cheryl 決定利用附近新開的健身房)。或是 “The community decided to make use of an empty lot and turn it into a neighborhood garden”(社區決定利用一塊空地,將其改造成社區花園)。再看一句,“In a new recipe, the chef made use of the spices she had brought back from South America”(廚師在新食譜中使用了她從南美洲帶回來的香料)。

謝謝你,Jack。我們繼續閱讀。朋友們,如果你曾經使用過像 Dolly 這樣的文字生成圖片 AI 工具,你會發現文字轉視頻工具的工作方式也很相似。那麼它們是如何工作的呢?使用者提供像「貓在魚缸裡游泳」這樣的提示。你輸入「貓在魚缸裡游泳」,然後會發生什麼?工具會根據它知道的關於貓、水、魚以及它們如何互動的物理知識來創造視頻。
那麼物理學是什麼呢,朋友們?物理學是科學的一個分支,專注於物質和能量的本質和屬性。這是一個非常科學的術語,對吧?這意味著 AI 工具必須了解物理學是如何運作的,以及物體如何相互作用。還有另一個和物理學相關的詞,就是「物理的」(physical)。如果你說某物是「物理的」,這意味著它有一個實體,你可以觸摸到它。因此,物理學是關於物體如何接觸和協作的,包括像磁鐵這樣的東西。不過,電腦需要知道現實世界是如何運作的,才能給你提供一個真實的圖片。
我很喜歡你使用「相互作用」(interact with each other)這個詞,因為我們在這句話中看到了「interact」這個詞。如果你和某物進行互動,這意味著它們之間有關聯或接觸。例如,雖然他們在同一辦公室工作,但他們之間的互動不多。如果你對兩個人這麼說,意思是他們都在那裡,但可能不常交談。我們經常使用這個詞來表示交談。所以,Emory 和我會和你們互動,我們一起學習英語。
好,我們繼續閱讀,我們了解這個工具。文中說到,一個工具利用了所謂的視覺片段(visual patches),這些是幫助 AI 理解場景中每一個元素如何逐幀互動和發展的數據基礎建塊。這裡有一個定義:視覺片段是幫助 AI 理解場景中一切如何互動和逐幀發展的數據基礎建塊。
「基礎建塊」(building blocks)這個詞,我們知道它可以指玩具中的積木。很多孩子玩木積木,這是一種非常基本的玩具,但你可以用它們做出任何東西。在我們的文章中,「基礎建塊」的用法非常相似,指的是組成某物的基本單位。因此,我們不僅在談論數據時使用這個詞,我們在很多不同的情況下都會用到它。「基礎建塊」可以是任何東西,例如我們身體的基礎建塊就是細胞,或者語言的基礎建塊就是詞彙。你需要了解語法和詞彙,所以持續擴充你的英語詞彙基礎建塊,很快你就能夠說任何話。

separate fact from fiction

基於擴散系統的工具,如 OpenAI 的 Sora,經過訓練來識別物體,這樣它們就能通過過濾掉不正確的視覺片段來精煉圖像。這個過程可以比作一個工人多次熔煉原金,以去除所有缺陷。

最後,我們來看「filter」這個字。它既可以是動詞也可以是名詞。作為動詞時,其中一個意思是慢慢出現或滲入。例如,"The photographer filtered the light through a piece of light fabric to create a soft effect"(這位攝影師透過一塊輕薄的布料讓光線滲入,創造出柔和的效果)。此外,「filter」也可以指「過濾」或「過濾器」,例如 "You need to filter the water here before you drink it, as it is not pure"(這裡的水不乾淨,喝之前要先過濾一下),或是 "Walt replaced the old water filter to make sure his drinking water was clean and pure"(Walt 更換了舊的濾水器,以確保他的飲用水是乾淨的)。

接下來,讓我們來看「based on」這個片語。這是課程中標記為重點的片語之一,因為我們在對話中經常使用。當某事物是「based on」另一事物時,這意味著它是以另一事物為基礎或起點。比如說,一部電影可以是「based on」一本書。很多時候,書籍和電影之間的內容並不完全相似,但有時電影會從書中取部分故事,這樣電影就是「based on」那本書。
在這裡我們討論的是 OpenAI 的 Sora,這些工具經過訓練來識別物體,以便能夠精煉圖像。我們需要再次停下來,談談一些術語。文中提到「trained」這個詞,你能解釋一下嗎?當你說「我在學習如何烹飪」或「我在接受翻譯員的培訓」,這意味著你正在學習一組技能並進行實踐。因此,訓練是一個反覆進行的過程,不是一次性的事情,而是需要學習和教導的。
小狗在年幼時會接受訓練來學會良好的行為,而 AI 系統也需要訓練,以便學會如何變得越來越好。這樣它們就能識別物體,並精煉圖像。如果你「refine」某樣東西,意味著你使它變得更好或改進它。這是一個逐步的過程,「re-」表示「再次」,而「fine」則表示「改善」。因此,這是一個不斷訓練和改善圖像的過程。AI 將這些圖像多次處理,期望最終得到更好的圖像。
當我們說「原金」時,如果某樣東西是「raw」,意味著它還沒有經過處理或煮熟,你只是從地裡挖出來的。因此,有時原金中會包含其他金屬,你需要將它熔煉多次,直到剩下純金。這和處理圖像的過程很類似。AI 將圖像多次處理,希望最終能得到更好的效果。

大家好,我是偵探 Ernest Finder,今天我有一個有趣的事實告訴你們。你知道「Sora」在日語中是「天空」的意思嗎?這是真的,而「天空的極限」意味著無窮的可能性。有時我們說 AI 的發展沒有極限,但要小心。這就是今天的趣味事實。

Open AI Sora is based ____ a diffusion system.is that 
based on a diffusion system?
based in a diffusion system?
based with a diffusion system?
 or based by a diffusion system?
哇,這題有點難。我想「based on」像是基礎或起點,所以這裡應該是「based on」,因為一切都是在這之後建立的。沒錯,正確的介系詞是「on」。
朋友們,我們還沒完結關於文字轉視頻 AI 的討論,這真的是一個非常有趣的話題。我知道我學到了很多,希望你們會回來和我們一起學習,就在 Studio Classroom。

The Magic of Text-to-Video AI魔法-文字生成影片(下)
空中英語教室 20240907 (文尾有ChatGPT譯文)
Hello, friends, and welcome to Studio Classroom!
My name is Ann Marie, and my name is John.
Today is a great day to learn something new, so let’s do that together.
John, why are you making noise with that calculator over there?
Oh, I’m sorry. Yesterday we were talking about the new technology of generative AI, where you can type in a prompt, like a sentence, and the AI will give you a video. But you know what, Ann Marie? It made me miss old technology like the calculator.
You miss calculators? You have one on your phone.
I’m trying to get my calculator to give me a video of Will Smith eating spaghetti, but its just a bunch of 8s and 5s. That calculator is not going to be able to take text prompts and turn them into video like generative AI can. We were learning about that yesterday—how generative AI are deep learning models that can generate high-quality text, images, and content, and how some of these are becoming very, very bizarre, like videos of Will Smith eating spaghetti. These tools are learning more and more.
That’s right, because unlike my calculator, they have a lot of processing power. What they can do is use building blocks of data—pieces of data that they can put together and refine images, just making the pictures better and better. So, the technology is growing really fast. There is a lot to learn together and a lot of English to practice.
That’s right. So, friends, we are not done learning about this. Lets get right into our lesson today: the magic of text-to-video.
One of the remarkable things about OpenAIs Sora is that it can generate content in a variety of styles. When the project was disclosed last February, OpenAI claimed that Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.
Hi everyone, welcome to Language Lab. I’m Jack. 首先我們來看動詞 disclose,意思是公佈或是透露。例如:
The real estate agent had to disclose the houses history of flooding to anyone who was thinking of buying it.
Peggy finally felt safe enough to disclose her deepest fears and concerns for the first time.
The government was under pressure to disclose the total amount of damage caused by the chemical spill.
Alright, thank you so much, Jack. Getting into our lesson here, we read: “One of the remarkable things about OpenAIs Sora is that it can generate content in a variety of styles.” Let’s review that word "generate" from yesterday. Yes, to generate means to make or produce something. You could say the factory generates a lot of toys. A generator is also a noun and refers to a machine that makes or generates electricity.
But let’s keep reading here. When the OpenAI Sora project was disclosed last February, OpenAI claimed that Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.
Okay, let’s break some of that down. First of all, we see here that OpenAI claimed that Sora can do this. Now, when you see this word “claimed,” the idea is that someone is saying something, and others don’t know if it’s true or not. It might be true, but it might not necessarily come to pass as described. OpenAI is saying they have a product that can do this, and now everyone is waiting to see if it actually can. They say these videos can be a minute long, which is quite long, and they can maintain visual quality.
Now, if you maintain something, it means you’re causing or enabling a condition or situation to continue. For example, you could say, “I’m really trying to maintain the paint on my new car,” meaning you’re trying to keep it looking good. So maintaining visual quality means keeping the video’s appearance good.
We learned yesterday that maintaining visual quality is hard. That video of Will Smith was bizarre—it was strange. So, how is it going to achieve this? Partly by adherence to the user’s prompt. The prompt is the text that you put into the AI, and adherence means sticking to something and not deviating from it. For example, “Jack showed adherence to the rules when he would not walk on the grass,” meaning he was sticking to the rules.
Now, we can also use "adhere" as a verb. You might see this in situations such as, “Students, for your homework assignment, you need to adhere to the topics I gave you,” meaning you can only use those topics. There’s another word that uses this “adh” root that means to stick to, and that’s the word "adhesive." It’s not covered in this lesson, but an adhesive is a glue that makes two things stick together.
So that’s just some fun English! Well, Ann Marie, it’s time to read more. There’s so much to learn about AI and the magic of text-to-video.
But there are constraints to Sora’s capabilities, as OpenAI shows in some demo videos. The physics of how objects interact in the videos doesn’t always make sense. Sometimes people or objects will blend together, transform into other things, disappear, or appear out of nowhere. But if the technology can evolve from the bizarre spaghetti video to where it is now in a year, these limitations probably won’t last long.
接著來看名詞 "constraint"。意思是約束或限制。例如:
Due to budget constraints, the team had to give up certain features to stay within their financial limits.
The time constraint on the exam made it challenging for many students to answer all of the questions.
“Constrained,” dropping the “t” at the end, becomes the verb form. For example, “Their limited budget constrained the team’s ability to hire additional staff for the project.”
Alright, thank you, Jack. Let’s read on together, friends. But there are constraints to Sora’s capabilities. You saw that word “constraints” in your Language Lab. This sentence means that this technology is impressive, but it has its limitations. As OpenAI shows in some demo videos, the physics of how objects interact doesn’t always make sense.
Let’s take a look at a few things in this sentence. We see the word “demo,” which is short for “demonstration.” This usually shows a product or technique. We can use “demo” in other ways as well. Sometimes for a video game, the company will release a demo video game where you can play a level, but it’s not the whole game—just a little show. These demo videos have some problems. The physics of how objects interact doesn’t always make sense.
Physics is the way physical objects move and interact with each other. If something “makes sense,” it means it is understandable or logically sound. For example, “The instructions for how to build this furniture just don’t make sense. I can’t understand them.” And sometimes, you can use “make sense” by itself. For instance, if Ann Marie said, “I’m going to put a new roof on my house because the old one kept leaking,” I could just say, “Make sense,” meaning it’s a smart idea.
We use this in casual conversation often. When we read on, “Sometimes people or objects will blend together,” it means they become one thing, or maybe the colors bleed. They might transform into other things, disappear, or appear out of nowhere.
John, I’ve tested this a little bit before, or I’ve watched people test it. I’m not tech-savvy enough to use AI very well, but people have shown me things they’ve created, and I’m always surprised at how many extra hands or legs show up in these images.
It’s always hands!
Yes, it’s always hands because the physics of your hands are incredibly complicated. I mean, what a gift—your hands can move in so many ways. So AI doesn’t really understand the physics of hands well, and sometimes fingers will appear out of nowhere. “Out of nowhere” means to appear unexpectedly. For example, “Wow, my dog appears out of nowhere when I’m cooking food,” or “The plane appeared out of nowhere in the sky,” or “That storm came out of nowhere. We were sitting here enjoying ourselves, and suddenly it started to rain.”
Alright, but we read on, friends: “But if the technology can evolve from the bizarre spaghetti video to where it is now in a year, these limitations probably won’t last long.” AI has come a really long way even just in a year.
That’s right. When we say the limitations “won’t last long,” it means they are temporary and short-term. For example, “My new diet won’t last long” means it probably won’t continue for a long time.
Alright, friends, we’ll be right back after today’s Info Cloud.
Hello, friends. Welcome to Info Cloud.
Hey, Rex, what do you think about the slogan “seeing is believing”?
I have mixed feelings about it. For a lot of things, seeing is believing, which implies you believe something is true because you see it happen. That’s what many people think—seeing something is the way to find out if it is real or true. But I have to ask, does what you see always show you reality?
Oh, I see what you’re getting at. We see people who act very nicely in public, but in reality, they might be violent or mean, right? And with AI technology, you can produce a lot of false images that look real. So is seeing really believing?
Great point. There are many valuable things in life that we cannot see but do exist, such as love, courage, and faith. I have to say, sometimes believing is seeing.
我們經常聽到有人說“seeing is believing”,意思就是親眼看到的才是真的。其實,很多人內心是有“seeing is believing”的觀念。你告訴他發生了一件事情,他通常不會先相信你,他一定要先看到才相信那是真的。但其實很多事情不像表面上看起來的樣子。例如,很多大人物在公開場合表現出最好的一面,但私底下卻是另外一面。AI科技也可以製造出很逼真的圖像。Seeing is believing在某種程度上是真的,但卻不是絕對的。這一點我們要特別小心。這就是今天的 Infocloud,我們下次雲端見。
The Magic of Text to Video
Besides OpenAI’s Sora, Google has two products called Lumiere and Video Poet. But like Sora, they may not be available for public use yet. These tools are still being rigorously tested and refined to ensure that people cannot use them to produce inappropriate content or break copyright laws. In the meantime, you can still experiment with simple text-to-video tools like InVideo AI, a product created with marketing, content creation, and education in mind. It’s perfect for creating online tutorials, explainer videos, or providing visuals for a short story. Scan the QR code to see what InVideo created using this article’s text.
接著來看“rigorously”這個副詞,意思是嚴謹地、嚴格地。例如,The new software was rigorously tested under various circumstances to ensure its reliability and performance. 這款新軟體經過了各種情況的嚴格測試,以確保其可靠性和性能。或者是:Paulette rigorously followed her exercise routine, never missing a day at the gym. Paulette 在健身房嚴格遵循她的固定運動計畫,從不缺席。如果我們去掉“rigorously” 的尾部 “ly”,就變成形容詞 “rigorous”,例如:Isaiah went through a rigorous interview process that tested both his knowledge and skills. Isaiah 經歷了一套嚴謹的面試過程,測試了他的知識和技能。
最後來看“copyright”這個名詞,意思是版權。例如,The copyright of the famous book expired, so it entered public domain. 這本名著的版權已過期,進入了公共領域。或者是:Copyright laws protected the musician’s new song from others who wanted to record it. 著作權法保護了這位音樂家的新歌,不被其他人錄製。Copyright 也可以作為動詞,意思是取得版權。例如:Before publishing her photographs in the magazine, the photographer made sure to copyright them. 在雜誌上發表她的照片之前,攝影師確保先取得這些照片的版權。
Alright, friends, let’s see how our article is going to end today. Besides OpenAI’s Sora, Google has two products called Lumiere and Video Poet. But like Sora, they may not be available for public use yet. So, as of our date of filming, Sora and these other tools are not available for public use. Public use just means that anyone can buy and use it; it’s not restricted. These AI tools are still being rigorously tested and refined to ensure that people cannot use them to produce inappropriate content or break copyright laws. "Refined" means getting better and better, and "rigorous" means really careful and thorough.
We also see another word to discuss here: "inappropriate." We use this to describe something that is not suitable or proper in certain circumstances. For example, it would be inappropriate to wear sandals to a fancy restaurant. In our context, inappropriate content is that which is not suitable for anyone to watch, especially children. The opposite of inappropriate is "appropriate." We should always try to act and do things that are appropriate for everyone.
Now, moving on to the meantime. What can we do? In the meantime, you can still experiment with simple text-to-video tools like InVideo AI. "In the meantime" means during the time before something happens. For example, if something’s going to happen in the future, you might say, "My birthday is on Monday, but in the meantime, I’m getting ready and inviting my friends."
Sometimes my kids get sick, as Im sure yours do too. And if its late at night, like 11 PM, and not very serious, I might say to my husband, "I’ll take her to the doctor in the morning, but in the meantime, let’s all get some sleep."
When youre making future plans, you might say, "We have to do this, but in the meantime, let’s get ready." So, in the meantime, we can experiment with simple tools like InVideo. InVideo is a product created with marketing, content creation, and education in mind. It’s perfect for creating online video tutorials, which are explanations of how a task or job works, and providing visuals for a short story.
We have something special in our magazine. You can scan the QR code to see what InVideo created using this article’s text. It’s going to be very interesting for you to watch, so don’t miss that!
But in the meantime, let’s go to today’s fun fact.
Hello, friends! I’m Detective Ernest Finder, and I have a fun fact for you. Did you know that one of the first AI chatbots was made in 1964? That’s pretty crazy—such a long time ago! What did it do? It would ask questions. If you said, "I’m sad," it would ask, "Why are you sad?" That’s pretty cool and not a bad conversation, especially for a chatbot. That’s today’s fun fact!
Alright, friends, as we end our lesson today, let’s look at one of these "Talk About It" questions from your magazine: "What do you imagine AI will be able to do by this time next year?" What a great question! AI is refining all the time, so maybe by this time next year, we’ll have cars that can talk to us. Wouldn’t it be kind of cool if you went up to a parked car and it said, "Hey buddy, get away from me," or something like that? Or maybe it tells you how long it’s been in the parking spot, like, "I’m going to be here for a while longer. Keep moving."
Who knows? I’d be curious about what you think. How about this: use the English you know and write to us here at Studio Classroom Magazine to share your thoughts on AI. That’s a great idea and a great topic for you to discuss in English. Friends, we’ll see you next time right here on Studio Classroom!

你好,朋友們,歡迎來到 Studio Classroom! 我是 Ann Marie,我是 John。 今天是一個學習新知識的好日子,我們一起來學習吧!
John,你為什麼在那邊用計算機發出噪音呢? 哦,我很抱歉。昨天我們談到了生成式人工智慧的新技術,你可以輸入一個提示,比如一句話,然後人工智慧就會給你一個視頻。不過,你知道嗎,Ann Marie?這讓我懷念起像計算機這樣的舊技術。
你懷念計算機?你手機上有一個呢。 我在試圖讓我的計算機生成一個 Will Smith 吃意大利面的視頻,但它只顯示了一堆 8 和 5。那個計算機無法像生成式人工智慧那樣,根據文本提示生成視頻。我們昨天學習了這些——生成式人工智慧是深度學習模型,可以生成高質量的文本、圖像和內容,其中一些變得非常奇特,比如 Will Smith 吃意大利面的視頻。這些工具正在變得越來越智能。
關於 OpenAI 的 Sora,有一個值得注意的地方是它可以生成各種風格的內容。當這個項目在去年二月公開時,OpenAI 宣稱 Sora 能生成長達一分鐘的視頻,同時保持視覺質量和對用戶提示的遵循。
大家好,歡迎來到 Language Lab。我是 Jack。我們首先來看動詞 "disclose",意思是公佈或透露。例如:
Peggy 終於覺得安全,第一次透露了她最深的恐懼和擔憂。
好,謝謝你,Jack。接下來,我們來看我們的課程,讀到:“關於 OpenAI 的 Sora,有一個值得注意的地方是它可以生成各種風格的內容。”讓我們回顧一下昨天提到的單詞 "generate"。是的,"generate" 的意思是製造或產生某物。你可以說工廠生成了很多玩具。發電機(generator)也是名詞,指的是一種產生電力的機器。
但讓我們繼續讀下去。當 OpenAI Sora 項目在去年二月公開時,OpenAI 宣稱 Sora 能生成長達一分鐘的視頻,同時保持視覺質量和對用戶提示的遵循。
好的,讓我們拆解一下。首先,我們看到 OpenAI 宣稱 Sora 能做到這一點。當你看到 “claimed” 這個詞時,它的意思是某人說了某事,而其他人不知道它是否真實。它可能是真的,但也不一定會如描述般實現。OpenAI 說他們有一個能做到這些的產品,現在每個人都在等待看看它是否真的能做到。他們說這些視頻可以長達一分鐘,這已經很長了,並且能保持視覺質量。
如果你 “maintain” 某物,這意味著你讓某種狀況或情況持續下去。例如,你可以說:“我真的在努力維持我新車上的油漆,”意思是你在努力保持車輛外觀良好。所以,保持視覺質量意味著保持視頻的外觀良好。
我們昨天學到了,保持視覺質量很難。那個 Will Smith 的視頻很奇特——它很奇怪。那么,它是如何做到這一點的呢?部分是通過遵循用戶的提示。提示是你輸入到人工智慧中的文本,而 “adherence” 意味著堅持某事而不偏離。例如,“Jack 在不踩草坪的時候表現出對規則的遵循,”意味著他遵守了規則。
我們也可以把 “adhere” 用作動詞。例如,你可能會看到這樣的情境:“學生們,你們的作業需要遵循我給的主題,”意味著你只能使用這些主題。還有一個使用 “adh” 詞根的詞,表示粘附,那就是 “adhesive”。這在這一課中沒有涵蓋,但 “adhesive” 是一種使兩物粘在一起的膠水。
這些就是一些有趣的英語!好,Ann Marie,現在該閱讀更多內容了。關於人工智慧和文字轉視頻的魔力有太多東西要學習了。
但是 Sora 的能力也有一些限制,OpenAI 在一些演示視頻中顯示了這些限制。視頻中物體互動的物理現象有時不太合理。有時人或物體會融合在一起,變成其他東西,消失或突然出現。但如果技術能從奇特的意大利面視頻發展到現在的狀況,那麼這些限制可能不會持久。
接下來我們來看名詞 “constraint”。意思是約束或限制。例如:
“Constrained”,去掉結尾的 “t” 變成動詞形式。例如,“他們有限的預算限制了團隊為項目招聘額外人員的能力。”
好的,謝謝你,Jack。讓我們一起讀下去,朋友們。Sora 的能力有一些限制。你在 Language Lab 中看到了 “constraints” 這個詞。這句話的意思是這項技術很令人印象深刻,但它有其局限性。正如 OpenAI 在一些演示視頻中顯示的,物體互動的物理現象有時不太合理。
我們來看看這句話中的一些東西。我們看到 “demo” 這個詞,它是 “demonstration”(演示)的縮寫。這通常顯示一個產品或技術。我們也可以用 “demo” 來指其他方式。有時,遊戲公司會發佈一個演示版遊戲,你可以玩一個關卡,但不是整個遊戲——只是小小的展示。這些演示視頻存在一些問題。物體互動的物理現象有時不太合理。
物理學是物體如何運動和相互作用的方式。如果某事 “makes sense”,意味著它是可以理解的或邏輯上合理的。例如,“這些組裝家具的說明書就是不合理。我無法理解。”有時你可以單獨使用 “make sense”。例如,如果 Ann Marie 說:“我打算給房子換新屋頂,因為舊的經常漏水,”我可以直接說:“Make sense”,意思是這是一個明智的主意。
總是手! 是的,總是手,因為手的物理學非常複雜。我意思是,真是一個禮物——你的手可以以多種方式運動。因此,人工智慧對手的物理學理解得不好,有時手指會突然出現。“Out of nowhere” 意味著意外地出現。例如,“哇,我的狗在我做飯時突然出現,”或者“飛機在天空中突然出現,”或者“那場風暴突然來襲。我們坐在這裡享受時光,突然開始下雨。”
好了,朋友們,我們在今天的 Info Cloud 之後馬上回來。
大家好,歡迎來到 Info Cloud。 嘿,Rex,你怎麼看「眼見為憑」這個口號? 我對這個口號有點複雜的感覺。對很多事情來說,眼見為憑意味著你因為看到發生的事情而相信它是真的。很多人認為,看到某些事情就是了解它是否真實或正確的方式。但我得問,看到的東西是否總是能顯示現實? 哦,我明白你的意思了。我們看到有些人表面上非常友善,但實際上可能很暴力或兇惡,對吧?而且,利用 AI 技術,你可以製造出很多看起來真實的虛假影像。所以,眼見是否真的就是相信呢? 很好的觀點。生活中有很多寶貴的東西我們無法看到但它們確實存在,例如愛、勇氣和信仰。我得說,有時候,信仰就是看到。
我們經常聽到「眼見為憑」這句話,意思是親眼看到的才是真的。其實,很多人內心深處相信「眼見為憑」。你告訴他發生了一件事情,他通常不會先相信你,必須先看到才能相信那是真的。然而,很多事情並不像表面上看起來的那樣。例如,很多名人在公開場合表現出最好的一面,但私底下卻完全不同。AI 科技也能製造出非常逼真的影像。眼見為憑在某種程度上是正確的,但並不是絕對的。我們需要特別小心。這就是今天的 Info Cloud,我們下次雲端見。
除了 OpenAI 的 Sora,Google 還有兩款產品叫 Lumiere 和 Video Poet。但像 Sora 一樣,它們可能尚未對公眾開放使用。這些工具仍在嚴格測試和改進中,以確保人們無法用它們來製作不當內容或違反版權法。在此期間,你仍然可以嘗試一些簡單的文字轉影片工具,比如 InVideo AI,它是一個以營銷、內容創建和教育為目的的產品。它非常適合製作在線教程、解釋視頻或為短篇故事提供視覺效果。掃描 QR 碼看看 InVideo 如何利用本文的文字創作內容。
接下來看「rigorously」這個副詞,意思是「嚴謹地」、「嚴格地」。例如:The new software was rigorously tested under various circumstances to ensure its reliability and performance. 這款新軟體在各種情況下經過了嚴格的測試,以確保其可靠性和性能。或者:Paulette rigorously followed her exercise routine, never missing a day at the gym. Paulette 嚴格遵循她的運動計劃,從不缺席。如果我們去掉「rigorously」的尾部「ly」,變成形容詞「rigorous」,例如:Isaiah went through a rigorous interview process that tested both his knowledge and skills. Isaiah 經歷了一個嚴謹的面試過程,測試了他的知識和技能。
最後來看「copyright」這個名詞,意思是「版權」。例如:The copyright of the famous book expired, so it entered public domain. 這本名著的版權已過期,進入了公共領域。或者:Copyright laws protected the musician’s new song from others who wanted to record it. 版權法保護了這位音樂家的新歌,防止其他人錄製。Copyright 也可以作為動詞,意思是「取得版權」。例如:Before publishing her photographs in the magazine, the photographer made sure to copyright them. 在雜誌上發表她的照片之前,攝影師確保先取得這些照片的版權。
好了,朋友們,我們來看看今天的文章會如何結束。除了 OpenAI 的 Sora,Google 還有兩款產品叫 Lumiere 和 Video Poet。但像 Sora 一樣,它們可能尚未對公眾開放使用。因此,在我們拍攝的日期,Sora 和其他這些工具還未對公眾開放使用。公眾使用意味著任何人都可以購買和使用,而不受限制。這些 AI 工具仍在嚴格測試和改進中,以確保人們無法用它們來製作不當內容或違反版權法。「Refined」意味著變得越來越好,「rigorous」則指非常仔細和徹底的。
現在,來看看「in the meantime」吧。我們可以做些什麼呢?在此期間,你仍然可以嘗試一些簡單的文字轉影片工具,如 InVideo AI。「In the meantime」意思是在某件事情發生之前的時間。例如,如果將來有某件事要發生,你可能會說:「我的生日在週一,但在此期間,我在準備並邀請我的朋友。」
當你在制定未來計劃時,你可能會說:「我們需要做這個,但在此期間,我們先做好準備。」所以,在此期間,我們可以嘗試簡單的工具如 InVideo。InVideo 是一個以營銷、內容創建和教育為目的的產品。它非常適合製作在線視頻教程,即說明某個任務或工作的方式,以及為短篇故事提供視覺效果。
我們的雜誌中有一個特別的東西。你可以掃描 QR 碼來看看 InVideo 如何利用這篇文章的文字創作內容。這將會非常有趣,所以不要錯過!
大家好!我是侦探 Ernest Finder,今天有個有趣的事實要告訴你們。你知道嗎,最早的 AI 聊天機器人之一是在 1964 年製作的?這實在太令人驚訝了——那是很久以前的事了!它做了什麼?它會問問題。如果你說「我很傷心」,它會問「你為什麼傷心?」這非常有趣,尤其是對於一個聊天機器人來說,這樣的對話也不算差。這就是今天的趣味事實!
好了,朋友們,當我們結束今天的課程時,讓我們來看看雜誌中的「討論話題」問題之一:「你想像 AI 在明年這個時候能做到什麼?」真是個好問題!AI 正在不斷改進,也許到了明年這個時候,我們會有會跟我們對話的車子。如果你走到一輛停著的車前,它對你說「嘿,伙計,離我遠點」這樣會不會很酷?或者,它告訴你它在這個停車位上待了多久,例如「我還會在這裡待一會兒。繼續走吧。」
誰知道呢?我對你的想法很感興趣。不妨這樣做:用你知道的英文寫信給我們的 Studio Classroom Magazine,分享你對 AI 的看法。這是一個很好的主意,也是一個很好的英文討論話題。朋友們,我們下次在 Studio Classroom 再見!
