Embedding a Speech Transcript in a Flash Video with Adobe CS5
CreativeCOW presents Embedding a Speech Transcript in a Flash Video with Adobe CS5 -- Adobe Flash Tutorial


www.hurwicz.com
Eastsound Washington USA
CreativeCOW.net. All rights reserved.


In this tutorial, Creative COW Leader Michael Hurwicz shows you how to create a scrolling transcript synced with a Flash video, using Adobe Creative Suite 5 (Story, OnLocation, Premiere Pro, Soundbooth and Flash Professional).



Play Video TutorialDownload Project Files


Embedding Speech Transcription into Flash Video with Adobe Creative Suite 5

by Michael Hurwicz

This article accompanies a Creative COW video tutorial showing how to embed a speech transcript in a Flash video project using Adobe Creative Suite 5. You start by creating a "reference script" created in Adobe Story, scriptwriting software which is free (for now, anyway), provided as an online service but also downloadable to run offline under Adobe AIR. Then you embed that script into the video using Adobe OnLocation. The speech analysis is performed in Premiere Pro. Finally, you use Adobe Soundbooth to ferry the speech transcription metadata into your Flash project.

This workflow evolved from an earlier workflow developed for CS4, which required fewer applications, but provided a less accurate speech transcription. (See below for details of this evolution.)

Tim Siglin

The workflow described in the video tutorial looks like this:

You can use the FLV file for your Flash Professional project.

It's true, this really shouldn't be called automatic transcription at all. More like automatic script-to-video synchronization or alignment. Whatever you call it, it's still a time-saving feature, because it allows you to get an excellent embedded transcript without having to use the horrible speech analysis editing interface in Premiere Pro or Soundbooth. Instead, you get to use the really quite nice editing interface in Adobe Story, which has all sorts of helpful features to make your job easier.

See the next section for details of creating the Flash Professional project. ActionScript code follows. Also consult the working example, cs5_speech_transcript.fla.

=========================================================================

Creating the Flash Professional CS5 "Scrolling Transcript" Application

The basis of the Flash application is the FLVPlayback component, a video player that comes with the Flash authoring environment. The other main visual element is a TextArea component to hold the scrolling transcript.

There are four basic phases to creating the Flash project:


1.  Create a new project.

2.  Populate the Library.

3.  Cut and paste the ActionScript code.

4.  Publish

That’s it! Deploy your SWF (including SkinUnderPlaySeekMute.swf), HTML, and FLV files, and you should have a working application.

(Note: This application was tested successfully with Windows 7, 64-bit., on an HP Z600 Workstation with 12GB of RAM)

=========================================================================

// ActionScript code for Creative COW article:
// Embedding Speech Transcription into Flash Video with Adobe Creative Suite 5
// by Michael Hurwicz
import fl.video.*;
import fl.controls.*;

// keep the TextArea scrolled to the bottom as text appears
function autoscroll()
{
setTimeout(function():void{ta.verticalScrollPosition = ta.maxVerticalScrollPosition;},100);
}

// add new text to the bottom of the TextArea and scroll
function cuePointListener(event1:MetadataEvent):void
{
trace("!!!");
//trace("Elapsed time in seconds: " + flvPlayer.playheadTime);
// with CS4 the text was in event1.info.name
// things are more complex with CS5

for (var key:String in event1.info)
{
trace(key + ": " + event1.info[key]);
for (var key2:String in event1.info[key])
{
trace(key2 + ": " + event1.info[key][key2]);
if (key2 == "data")
{
trace("data found");
ta.text+=(" "+event1.info[key][key2]);
autoscroll();
}
}

}
}

// show that metadata is available (for debugging)
function receivedListener(event1:MetadataEvent):void
{
trace("### metadata available ###");
}

// create the video player
var flvPlayer:FLVPlayback = new FLVPlayback();
flvPlayer.width = 656;
flvPlayer.height = 480;
flvPlayer.bufferTime = 60;
// flvPlayer.fullScreenTakeOver = false;
//The next line assumes you have copied the skin file to the local directory
flvPlayer.skin = "SkinUnderPlaySeekMute.swf";
// *** the following line is the one you want to change to your own FLV file *** //
flvPlayer.source = "tim_siglin_red_carpet.flv";
addChild(flvPlayer);

// listen for cue points and metadata availability
flvPlayer.addEventListener(MetadataEvent.CUE_POINT, cuePointListener);
flvPlayer.addEventListener(MetadataEvent.METADATA_RECEIVED, receivedListener);

// create the TextArea and format the text;
var textAreaFormat:TextFormat = new TextFormat();
textAreaFormat.size = 18;
textAreaFormat.italic = false;
var ta:TextArea = new TextArea();
ta.x = 650;
ta.y = 10;
ta.width = 150;
ta.height = 300;
ta.setStyle("textFormat", textAreaFormat);
ta.text = "";
addChild(ta);

=========================================================================

Background/History

Horrible, Powerful Speech Transcription

Both Adobe Premiere Pro and Adobe Soundbooth can do what Adobe now terms "speech analysis," automatically analyzing the spoken words in a video and creating a text transcript. (See my April, 2009, article, "Creating Automatic Transcripts in Flash Video Using Adobe CS4.") There are two main weak points in that process: 1) the accuracy of the speech recognition and 2) the ease of editing the transcript.

For normal speech – even speech that sounds quite clear to you and me – the transcription (if you're not using the new workflow which I'm about to describe) is often laughably, divertingly bad. Less than 50% accuracy is quite typical. (The software tries its best, however, which is where the entertainment value comes in. For instance, in the video used for this article, Premiere Pro CS5 – with no reference script to guide it – put the following words in Tim Siglin's mouth: "Get sexy and in Ensenada with a friend that he cannot said terrible acts so there it seems he knows are thinking that yesterday I had to visit this tag and man the woman ..." Which may be true, but it's not what Tim said.)

As for the editing function (in Premiere Pro or Soundbooth), you edit one word at a time, and you have to double-click a word before you can edit it. (If you need to edit several consecutive words, you can tab from one to the next.) And to shift the timing of a word, you have to right-click it first and then select "Merge With Next Word" or "Merge With Previous Word" from a context menu. The same is true for deleting or copying a word. "Cumbersome" is too kind a word for it. It's horrible.

However, what you end up with is potentially powerful: an XML file that gives you text and timing for every spoken word in the video. This could be used, for instance, to create subtitles for a Flash video project (which is basically what I did with CS4 last year) or to create a searchable index that actually takes you to the point in the video where the search item occurs (which Adobe has done via Encore CS5, DVD / Blu-ray authoring software).

In the Flash authoring environment (Flash Professional), your imagination and your ActionScripting skills are the only limits to what you can do with the embedded transcript. You could ring a bell every time the "magic word" for the day comes up, display text ads based on the topics being discussed in the video, display links to related information, or enforce parental controls based on the language used.

A Cool, Boring, Typical Reference Script Scenario

It's this flexibility that fueled my interest in developing a methodology for getting the speech transcript into a form that could be accessed in the Flash authoring environment via ActionScript. Specifically, that means that each word in the speech transcript has to be associated with a "cue point," a marker in the Flash video (FLV) file which can be used for navigation or to trigger any event that can be programmed via ActionScript.

So, when I started investigating Adobe Creative Suite 5, though I was suitably impressed by the "big" new features for video, like increased performance, tapeless workflows, and sharing with FCP and Avid users, I was equally curious whether they had done anything to improve speech transcription accuracy and editing, or integration of transcription data into Flash. In addition, since the speech transcript is metadata embedded in the video file, I was interested to see how speech transcription might fit in with Adobe's new metadata-driven "script-to-screen" workflow.

It turns out that CS5 offers one significant new feature relating to automatic transcription: the "reference script," a script that Premiere Pro (or Soundbooth) can be guided by while doing the transcription. It is very much a "script-to-screen" feature.

One typical scenario for using the reference script is as follows:

When you play the Web DVD in the Flash DVD Player, the embedded text is searchable. So, if you're watching a cooking video, you type "salad" in the Flash DVD Player search box, and it takes you to the points in the video where that word is spoken. It's really pretty cool. But it's also boring, because it's just too easy.

Adobe has a video on this, if you're interested.

Getting the Same Old Result with More Work (Whoopee)

My interest was in a different workflow, starting with an existing video and ending in an ActionScript-based Flash Professional project. Initially, I thought that because I was starting with an existing video clip, I would have no need for OnLocation. I was very, very wrong. I also thought I would have no need for Encore, since my goal was to create a project in Flash Professional. That turned out to be correct.

As I mentioned above, for a Flash Professional project, you want to create a Flash video file (FLV) with embedded cue points that you can access via ActionScript. In the past, I have used the Adobe Media Encoder to embed cue points in FLVs, and then used Flash Professional to create an FLV player. With that approach, I can use ActionScript to determine what happens when a cue point is detected during playback.

An obstacle to this workflow with the typical "script-to-screen" workflow is that Flash Professional and the Adobe Media Encoder both want to work with cue points, while the "script-to-screen" workflow in other Adobe applications (Story, OnLocation, Premiere Pro, Encore and the Flash DVD Player) is based on data stored in the XMP (Extensible Metadata Platform) format. However, this is not a show-stopper: As I showed in the article mentioned above, Adobe Soundbooth can import embedded XMP speech transcription metadata and export an XML cue point file that the Adobe Media Encoder can use to embed cue points in an FLV. Those cue points can then be referenced via ActionScript in a player created with Flash Professional.

One thing that threw me for a minor loop was that the format of the XML cue point file created by Soundbooth has changed from CS4 to CS5: it's gotten more complex. Reflecting that, the cue points themselves are more complex, and the ActionScript that looks at them has to be a bit more sophisticated. In my ActionScript code, the cue point metadata object is called event1.info, and the code for accessing the transcription text in CS4 was simply event1.info.name. Couldn't be much simpler. With CS5, the text is buried a couple of levels down in the event1.info object, and the ActionScript code has to unwrap those layers to get at the text. (See the cuePointListener function in the ActionScript accompanying this article to see how that's done.)

By overcoming this complication, I was able to duplicate last year's results. (Whoopee.)

Bring in the Reference Script, Please!

The workflow I developed for CS4 does still work with CS5 (with a minor ActionScript code change described in the previous section), and it could still be a useful tool, I guess. Even though, with the video I was testing with for this article (the first minute or so of Tim Siglin's "Red Carpet" interview from Streaming Media East 2010), I only hit 41% accuracy using last year's approach.

However, by bringing in a reference script, things started to get a little more interesting. I had been told by a generally reliable informant (Karl Soulé, Adobe Technical Evangelist), that even a plain text file with some key phrases from the video would be used by the speech analysis software to decode those phrases. That sounded pretty cool: No need to make a complete transcript; any text file that happens to have a bunch of the same phrases used in the video could be helpful. Of course, how helpful it is depends on the particular video and the particular text file.

I created a 54 word text file which I thought addressed the areas that the speech analysis software had the most trouble with. For my efforts, I got a mere 4% improvement, from 41% accuracy to 45%. This was disappointing, since the whole video has only about 220 words in it, and the number of words correctly identified increased by a mere five words, from 87 to 92. Not great for a 54 word reference "script"..

So, I tried creating a complete transcription of the video in Adobe Story. That yielded 82% accuracy. I was somewhat pleased and somewhat puzzled. I had doubled my accuracy. But why didn't I get at least 95% accuracy from a 95% accurate script? (The 5% inaccuracy, by the way, is because I didn't transcribe "ums" or dysfluencies like repetitions of words. This made for a more readable transcription but was slightly less faithful to the actual spoken word.)

There were various reasons, I think, for this imperfect performance. In a couple of places, both people were talking at the same time. While the human brain handles this easily enough, speech transcription software doesn't. In other places, the software tried to make sense of a non-meaningful sound, like an "um". Then there were the mysterious failures, where, even with the precise words available to it, and no obvious complicating factors, the software failed to recognize certain phrases. For instance, near the end, Tim says "the Z18 or the ZX3". Even after being told, via the Story script, to expect this, the software translated it as "to see you one day for this the X ray". There were several other surprising deficits in the speech analysis, including loss of most of the punctuation. See the second page of this article for details.

Anyway, the 82%-accurate workflow in CS5 looked like this:

(There's a link at the end of this article to step-by-step instructions for creating the Flash application.) Again, this is basically the same workflow I described last year for CS4, with the addition of a reference script.

In a way, of course, this approach fails to deliver on the promise of automatic speech transcription: You have to do the initial transcribing manually. But, let's face it, the unaided software really isn't very good at recognizing words anyway. On the other hand, it is excellent at determining the timing of words once they've been recognized (a function sometimes referred to as "script alignment" as opposed to speech recognition). This approach uses the software for what it's good at, and the human brain for what it's good at. This also could be the approach of choice if you already have good transcripts of your videos (they don't have to be Adobe Story scripts) and you just want to embed them in the videos. The accuracy won't be perfect. You will lose a lot, in my experience. But it's not much effort, either, if you already have the transcripts.

(That being said, there are hybrid approaches in which you let the software take a first stab at decoding the words, either with no reference script or with a quick-and-dirty reference script, and then use the results of that decoding to create a new improved reference script, after which you follow the same workflow shown above. Given that the software can often hit 30% - 50% accuracy with no help at all, this approach could save you a lot of typing when creating an initial transcript. The minimal reference script, in which you give the software just a few difficult key phrases, is an intriguing approach to me. Though, as I have said, it didn't really help me much for the particular clip I was working with.)

When I first saw a "perfect" transcription (as perfect as my reference script, anyway) appear as embedded metadata in Premiere Pro, I was thrilled. I assumed that this perfection would transfer, via Soundbooth and the Adobe Media Encoder, into the cue points of my FLV.

Not true. When I tried to use the XML file from Soundbooth in the Media Encoder, it worked very imperfectly, if at all. I pored over the XML, but couldn't see anything wrong with it (or anything significantly different from earlier, pre-OnLocation XML files, which had worked). I tried several other things, which I will spare you, some of which involved contact between my head and the wall.

So, still not satisfied with 82% accuracy from a 95% accurate script, I re-examined my assumptions, scanned some forums, Googled "what am I doing wrong?" and even (gulp) read some help files. What I discovered (thank you Curt Wrigley, Community Professional on Adobe Forums) was that by bringing OnLocation into the act, I was able to make the embedded transcription exactly match the Adobe Story script.

In addition, I discovered that I got consistently good results using Soundbooth, rather than the Media Encoder, for the final process of creating an FLV with embedded cue points. Why this works better, I don't know. But I am gladdened by the fact that it does.

 

Testing environment:

HP Z600 Workstation, 12GB RAM, Dual Processor 2.93GHz
NVIDIA Quadro FX 3800 Graphics Adapter
Windows 7 64-bit

 

 

 



Embedding Speech Transcription into Flash Video with Adobe Creative Suite 5

by Michael Hurwicz


This Page Provides Detailed Comparison of Adobe CS5 Speech to Text Using:

1) No Reference Script

2) A Bare Bones Plain Text Reference Script

3) A Complete Adobe Story Reference Script, but Not Embedded in the Video File

An example using an embedded Adobe Story script (an approach that uses OnLocation as an intermediary between Story and Premiere Pro) is not included here, because it was 100% faithful to the script, a state of affairs which can be easily communicated without a three-column table.

All speech analysis was performed at the "High (Slower)" setting in the Premiere Pro "Analyze Content" dialogue.

Note that the "Comments" column shows correctlty recognized words in 1) and 2) but mistakes in 3).

1) No Reference Script

Speech Analysis Reference Script Comments

[Speaker 0] thanks for joining the flu after the fairer sex in the debate on the spot I just couldn't wait to try to

[Speaker 1] get sexy and in Ensenada with a friend that he cannot said terrible acts so there it seems he knows are thinking that yesterday I had to visit the tag and man the woman dubbed stay running out of the recapture

[Speaker 2] of England's so just briefly from Jeff yesterday just a big guy is the South Dakota neighbours lights and desperate personnel who likes to ride horses those of a wrench the government but then what was most interesting to me especially where I live now in Kingsport Tennessee where we have the other half of these days the chemical was to really see him talk about how to take an organization that was fairly staid and traditional and turn it into something that was for use as a potential big banana one of things I thought they really get on well with us they have these they have these products that are pretty good in terms of these video cameras are putting out but they have branding names like to see you one day for this the extra re doing so

NA

 

Correct transcriptions:

thanks for joining

sexy

yesterday

Jeff yesterday

South Dakota

likes to ride horses

but then what was most interesting to me especially where I live now in Kingsport Tennessee where we have the other half of

was to really see him talk about how to take an organization that was fairly staid and traditional and turn it into something that was

they have these products that are pretty good in terms of these video cameras are putting out but they have branding names like

 

good words/total words=

87/210

success = 41%

2) Reference Script, bare bones, text format (not Adobe Story)

Speech Analysis Reference Script Comments

[Speaker 0] thanks for joining me absolutely after the furor sexy metadata taught us Islanders can wait but I

[Speaker 1] had sexy and in Ensenada with a friend that he cannot said triple axe so there it seems the answer keynote yesterday I to visit Kodak and man U woman got staid running out of the recapture

[Speaker 2] of England's so just briefly from jeff yesterday jeff's a big guy is the South Dakota neighbours likes to dance big personnel they likes to ride horses those of a wrench the government but to what was most interesting to me especially where I live now in Kingsport Tennessee where we have the other half of the Eastman Eastman chemical was to really see him talk about how to take an organization that was fairly staid and traditional and turn it into some bling the wounds of war is as potential to be dynamic one of things that fall to be really good on well with us they have these have these products that are pretty good in terms of these video cameras are putting out but they had branding names like to see you one day for this the next for a dentist

thanks for joining me
absolutely
sexy metadata
g to pg and up to triple x
keynote
jeff
woman from yahoo
jeff's a big guy
likes to ride horses
where we have the other half of the Eastman Kodak
staid
potential to be dynamic
video cameras
branding names like the Z18 or the ZX3

Correct transcriptions:

thanks for joining me absolutely

sexy metadata

keynote yesterday

sexy

jeff's a big guy

South Dakota

likes to ride horses

what was most interesting to me especially where I live now in Kingsport Tennessee where we have the other half of the Eastman Eastman chemical was to really see him talk about how to take an organization that was fairly staid and traditional and turn it into

potential to be dynamic

have these products that are pretty good in terms of these video cameras

putting out but they had branding names like

 

good words/total words=

92/204

success = 45%

 

 

3) Reference Script, Adobe Story ASTX format, "Script Text Matches Recorded Dialogue" checked in the Premiere Pro "Import Script" Dialogue

Speech Analysis Reference Script Comments
[Speaker 0] Thanks for joining me Oh absolutely After After our sexy metadata talk last time I just couldn't wait to try to get

PETER CERVIERI

Thanks for joining me.

TIM SIGLIN

Oh, absolutely! After our sexy metadata talk last time, I just couldn't wait to come back.

Mistakes:

try to get

[Speaker 1] sexy and in Ensenada quickly from G and he cannot said triple axe So there it seems keynotes there is a keynote yesterday I had to me is that the tag and then the woman got staid want to give kind of a recap

PETER CERVIERI

We'll get sexy again. It's going to go quickly from G to PG and up to triple X. So, there were two keynotes. There was a keynote yesterday: Jeff Hayzlett from Kodak. And then the woman from Yahoo today. You want to give kind of a recap of the take-aways?

Mistakes:

to PG and up to triple X.

were two

was

Jeff Hayzlett from Kodak.

from Yahoo

got staid

of the take-aways?

[Speaker 2]

Sure England's So just briefly Jeff yesterday Jeff's a big guy is a South Dakotan Beavers likes to dance big personality likes to ride horses has a ranch that kind of thing but I think what was most interesting to me especially where I live now in Kingsport Tennessee where we have the other half of the Eastman Eastman Chemical was to really see him talk about how to take an organization that was fairly staid and traditional and turn it into something that was the or is that has the potential to be dynamic One of things I thought he really hit on well was they have these they have these products that are pretty good in terms of these video cameras they are putting out but they had branding names like to see you one day for this the X ray and he said

TIM SIGLIN

Sure. So just briefly. Jeff, yesterday. Jeff's a big guy. He's a South Dakotan.

PETER CERVIERI

Big personality.

TIM SIGLIN

He has a big personality, likes to ride horses, has a ranch, that kind of thing. But I think what was most interesting to, me especially where I live now in Kingsport, Tennessee, where we have the other half of the Eastman, Eastman Chemical, was to really see him talk about how to take an organization that was fairly staid and traditional and turn it into something that has the potential to be dynamic. One of the things that I thought he really hit on well was, they have these products that are pretty good in terms of these video cameras they are putting out, but they had branding names like the Z18 or the ZX3, and he said ...

Mistakes:

England's

He's

Big personality.

dance

the or is

to see you one day for this the X ray

mistakes/total =

39/ 218 = 18%

success = 82%

Some surprising glitches even in the example 3):

  • misses the first character change even though it has the exact wording
  • never gets character names
  • misses the following even given the exact wording and there's no other particular interference (such as two people talking at the same time, broken speech pattern):
    • "g to pg and up to triple x"
    • "woman from yahoo"
    • "the Z18 or the ZX3"

The third item in the above list is also somewhat surprising for example 2) above, since it was provided in the bare bones reference script.

astx file (Right click and "Save Link As ...")