Embedding a Speech Transcript in a Flash Video with Adobe CS5
COW Library : Adobe Flash Tutorials : Michael Hurwicz : Embedding a Speech Transcript in a Flash Video with Adobe CS5
Embedding Speech Transcription into Flash Video with Adobe Creative Suite 5
by Michael Hurwicz
This article accompanies a Creative COW video tutorial showing how to embed a speech transcript in a Flash video project using Adobe Creative Suite 5. You start by creating a "reference script" created in Adobe Story, scriptwriting software which is free (for now, anyway), provided as an online service but also downloadable to run offline under Adobe AIR. Then you embed that script into the video using Adobe OnLocation. The speech analysis is performed in Premiere Pro. Finally, you use Adobe Soundbooth to ferry the speech transcription metadata into your Flash project.
This workflow evolved from an earlier workflow developed for CS4, which required fewer applications, but provided a less accurate speech transcription. (See below for details of this evolution.)
The workflow described in the video tutorial looks like this:
You can use the FLV file for your Flash Professional project.
It's true, this really shouldn't be called automatic transcription at all. More like automatic script-to-video synchronization or alignment. Whatever you call it, it's still a time-saving feature, because it allows you to get an excellent embedded transcript without having to use the horrible speech analysis editing interface in Premiere Pro or Soundbooth. Instead, you get to use the really quite nice editing interface in Adobe Story, which has all sorts of helpful features to make your job easier.
See the next section for details of creating the Flash Professional project. ActionScript code follows. Also consult the working example, cs5_speech_transcript.fla.
Creating the Flash Professional CS5 "Scrolling Transcript" Application
The basis of the Flash application is the FLVPlayback component, a video player that comes with the Flash authoring environment. The other main visual element is a TextArea component to hold the scrolling transcript.
There are four basic phases to creating the Flash project:
1. Create a new project.
2. Populate the Library.
3. Cut and paste the ActionScript code.
That’s it! Deploy your SWF (including SkinUnderPlaySeekMute.swf), HTML, and FLV files, and you should have a working application.
(Note: This application was tested successfully with Windows 7, 64-bit., on an HP Z600 Workstation with 12GB of RAM)
// ActionScript code for Creative COW article:
// keep the TextArea scrolled to the bottom as text appears
// add new text to the bottom of the TextArea and scroll
for (var key:String in event1.info)
// show that metadata is available (for debugging)
// create the video player
// listen for cue points and metadata availability
// create the TextArea and format the text;
Horrible, Powerful Speech Transcription
Both Adobe Premiere Pro and Adobe Soundbooth can do what Adobe now terms "speech analysis," automatically analyzing the spoken words in a video and creating a text transcript. (See my April, 2009, article, "Creating Automatic Transcripts in Flash Video Using Adobe CS4.") There are two main weak points in that process: 1) the accuracy of the speech recognition and 2) the ease of editing the transcript.
For normal speech – even speech that sounds quite clear to you and me – the transcription (if you're not using the new workflow which I'm about to describe) is often laughably, divertingly bad. Less than 50% accuracy is quite typical. (The software tries its best, however, which is where the entertainment value comes in. For instance, in the video used for this article, Premiere Pro CS5 – with no reference script to guide it – put the following words in Tim Siglin's mouth: "Get sexy and in Ensenada with a friend that he cannot said terrible acts so there it seems he knows are thinking that yesterday I had to visit this tag and man the woman ..." Which may be true, but it's not what Tim said.)
As for the editing function (in Premiere Pro or Soundbooth), you edit one word at a time, and you have to double-click a word before you can edit it. (If you need to edit several consecutive words, you can tab from one to the next.) And to shift the timing of a word, you have to right-click it first and then select "Merge With Next Word" or "Merge With Previous Word" from a context menu. The same is true for deleting or copying a word. "Cumbersome" is too kind a word for it. It's horrible.
However, what you end up with is potentially powerful: an XML file that gives you text and timing for every spoken word in the video. This could be used, for instance, to create subtitles for a Flash video project (which is basically what I did with CS4 last year) or to create a searchable index that actually takes you to the point in the video where the search item occurs (which Adobe has done via Encore CS5, DVD / Blu-ray authoring software).
In the Flash authoring environment (Flash Professional), your imagination and your ActionScripting skills are the only limits to what you can do with the embedded transcript. You could ring a bell every time the "magic word" for the day comes up, display text ads based on the topics being discussed in the video, display links to related information, or enforce parental controls based on the language used.
A Cool, Boring, Typical Reference Script Scenario
It's this flexibility that fueled my interest in developing a methodology for getting the speech transcript into a form that could be accessed in the Flash authoring environment via ActionScript. Specifically, that means that each word in the speech transcript has to be associated with a "cue point," a marker in the Flash video (FLV) file which can be used for navigation or to trigger any event that can be programmed via ActionScript.
So, when I started investigating Adobe Creative Suite 5, though I was suitably impressed by the "big" new features for video, like increased performance, tapeless workflows, and sharing with FCP and Avid users, I was equally curious whether they had done anything to improve speech transcription accuracy and editing, or integration of transcription data into Flash. In addition, since the speech transcript is metadata embedded in the video file, I was interested to see how speech transcription might fit in with Adobe's new metadata-driven "script-to-screen" workflow.
It turns out that CS5 offers one significant new feature relating to automatic transcription: the "reference script," a script that Premiere Pro (or Soundbooth) can be guided by while doing the transcription. It is very much a "script-to-screen" feature.
One typical scenario for using the reference script is as follows:
When you play the Web DVD in the Flash DVD Player, the embedded text is searchable. So, if you're watching a cooking video, you type "salad" in the Flash DVD Player search box, and it takes you to the points in the video where that word is spoken. It's really pretty cool. But it's also boring, because it's just too easy.
Adobe has a video on this, if you're interested.
Getting the Same Old Result with More Work (Whoopee)
My interest was in a different workflow, starting with an existing video and ending in an ActionScript-based Flash Professional project. Initially, I thought that because I was starting with an existing video clip, I would have no need for OnLocation. I was very, very wrong. I also thought I would have no need for Encore, since my goal was to create a project in Flash Professional. That turned out to be correct.
As I mentioned above, for a Flash Professional project, you want to create a Flash video file (FLV) with embedded cue points that you can access via ActionScript. In the past, I have used the Adobe Media Encoder to embed cue points in FLVs, and then used Flash Professional to create an FLV player. With that approach, I can use ActionScript to determine what happens when a cue point is detected during playback.
An obstacle to this workflow with the typical "script-to-screen" workflow is that Flash Professional and the Adobe Media Encoder both want to work with cue points, while the "script-to-screen" workflow in other Adobe applications (Story, OnLocation, Premiere Pro, Encore and the Flash DVD Player) is based on data stored in the XMP (Extensible Metadata Platform) format. However, this is not a show-stopper: As I showed in the article mentioned above, Adobe Soundbooth can import embedded XMP speech transcription metadata and export an XML cue point file that the Adobe Media Encoder can use to embed cue points in an FLV. Those cue points can then be referenced via ActionScript in a player created with Flash Professional.
One thing that threw me for a minor loop was that the format of the XML cue point file created by Soundbooth has changed from CS4 to CS5: it's gotten more complex. Reflecting that, the cue points themselves are more complex, and the ActionScript that looks at them has to be a bit more sophisticated. In my ActionScript code, the cue point metadata object is called event1.info, and the code for accessing the transcription text in CS4 was simply event1.info.name. Couldn't be much simpler. With CS5, the text is buried a couple of levels down in the event1.info object, and the ActionScript code has to unwrap those layers to get at the text. (See the cuePointListener function in the ActionScript accompanying this article to see how that's done.)
By overcoming this complication, I was able to duplicate last year's results. (Whoopee.)
Bring in the Reference Script, Please!
The workflow I developed for CS4 does still work with CS5 (with a minor ActionScript code change described in the previous section), and it could still be a useful tool, I guess. Even though, with the video I was testing with for this article (the first minute or so of Tim Siglin's "Red Carpet" interview from Streaming Media East 2010), I only hit 41% accuracy using last year's approach.
However, by bringing in a reference script, things started to get a little more interesting. I had been told by a generally reliable informant (Karl Soulé, Adobe Technical Evangelist), that even a plain text file with some key phrases from the video would be used by the speech analysis software to decode those phrases. That sounded pretty cool: No need to make a complete transcript; any text file that happens to have a bunch of the same phrases used in the video could be helpful. Of course, how helpful it is depends on the particular video and the particular text file.
I created a 54 word text file which I thought addressed the areas that the speech analysis software had the most trouble with. For my efforts, I got a mere 4% improvement, from 41% accuracy to 45%. This was disappointing, since the whole video has only about 220 words in it, and the number of words correctly identified increased by a mere five words, from 87 to 92. Not great for a 54 word reference "script"..
So, I tried creating a complete transcription of the video in Adobe Story. That yielded 82% accuracy. I was somewhat pleased and somewhat puzzled. I had doubled my accuracy. But why didn't I get at least 95% accuracy from a 95% accurate script? (The 5% inaccuracy, by the way, is because I didn't transcribe "ums" or dysfluencies like repetitions of words. This made for a more readable transcription but was slightly less faithful to the actual spoken word.)
There were various reasons, I think, for this imperfect performance. In a couple of places, both people were talking at the same time. While the human brain handles this easily enough, speech transcription software doesn't. In other places, the software tried to make sense of a non-meaningful sound, like an "um". Then there were the mysterious failures, where, even with the precise words available to it, and no obvious complicating factors, the software failed to recognize certain phrases. For instance, near the end, Tim says "the Z18 or the ZX3". Even after being told, via the Story script, to expect this, the software translated it as "to see you one day for this the X ray". There were several other surprising deficits in the speech analysis, including loss of most of the punctuation. See the second page of this article for details.
Anyway, the 82%-accurate workflow in CS5 looked like this:
(There's a link at the end of this article to step-by-step instructions for creating the Flash application.) Again, this is basically the same workflow I described last year for CS4, with the addition of a reference script.
In a way, of course, this approach fails to deliver on the promise of automatic speech transcription: You have to do the initial transcribing manually. But, let's face it, the unaided software really isn't very good at recognizing words anyway. On the other hand, it is excellent at determining the timing of words once they've been recognized (a function sometimes referred to as "script alignment" as opposed to speech recognition). This approach uses the software for what it's good at, and the human brain for what it's good at. This also could be the approach of choice if you already have good transcripts of your videos (they don't have to be Adobe Story scripts) and you just want to embed them in the videos. The accuracy won't be perfect. You will lose a lot, in my experience. But it's not much effort, either, if you already have the transcripts.
(That being said, there are hybrid approaches in which you let the software take a first stab at decoding the words, either with no reference script or with a quick-and-dirty reference script, and then use the results of that decoding to create a new improved reference script, after which you follow the same workflow shown above. Given that the software can often hit 30% - 50% accuracy with no help at all, this approach could save you a lot of typing when creating an initial transcript. The minimal reference script, in which you give the software just a few difficult key phrases, is an intriguing approach to me. Though, as I have said, it didn't really help me much for the particular clip I was working with.)
When I first saw a "perfect" transcription (as perfect as my reference script, anyway) appear as embedded metadata in Premiere Pro, I was thrilled. I assumed that this perfection would transfer, via Soundbooth and the Adobe Media Encoder, into the cue points of my FLV.
Not true. When I tried to use the XML file from Soundbooth in the Media Encoder, it worked very imperfectly, if at all. I pored over the XML, but couldn't see anything wrong with it (or anything significantly different from earlier, pre-OnLocation XML files, which had worked). I tried several other things, which I will spare you, some of which involved contact between my head and the wall.
So, still not satisfied with 82% accuracy from a 95% accurate script, I re-examined my assumptions, scanned some forums, Googled "what am I doing wrong?" and even (gulp) read some help files. What I discovered (thank you Curt Wrigley, Community Professional on Adobe Forums) was that by bringing OnLocation into the act, I was able to make the embedded transcription exactly match the Adobe Story script.
In addition, I discovered that I got consistently good results using Soundbooth, rather than the Media Encoder, for the final process of creating an FLV with embedded cue points. Why this works better, I don't know. But I am gladdened by the fact that it does.
HP Z600 Workstation, 12GB RAM, Dual Processor 2.93GHz