Wednesday, April 09, 2008

Transcribing speech utterances is a highly repetitive task, usually performed by a pool of people who are good at typing.

Out of the box, the speech server tools for transcribing are accessible through Visual Studio, and are not good for transcribing any volume of utterances.

The following attachment contains two Visual Studio 2005 projects that can get you on your way towards a fast transcription process for non developers (no Visual Studio needed for them).


MSSTranscriptionService.zip (3.75 MB)
4/9/2008 7:49:04 PM (GMT Daylight Time, UTC+01:00)  #    Comments [0]  |  Trackback
Thursday, April 03, 2008

Unit testing code is important.  If you make code changes in a library that other people are using, you want to make sure all of the code works as expected.  Using NUnit is great for that.

However, when it comes to speech applications, you probably manually test your applications before each release.

If you are writing managed code for Office Communications Server 2007 Speech Server, and you are using a SIP for your telephony lines, I have something that will help you automate your testing.  I created a simple class that will send SIP INFO requests to the caller if they include "log=true" in the SIP URI parameters.  It basically works like this:

  1. Call into your application and then generate a test script based on the call log.
  2. Run the customized OutboundCalls application, passing your newly generated script as the script to run.
  3. The OutboundCalls application will automatically go through the application, following the same path.

Attached is the unit testing code, a demo and some basic instructions.  Open the UnitTesting solution and read the ReadMe.htm file for all the details.

Happy Testing!

UnitTesting.zip (8 MB)
4/3/2008 6:13:06 PM (GMT Daylight Time, UTC+01:00)  #    Comments [0]  |  Trackback
Tuesday, June 05, 2007

In the health industry, I've frequently run into the following predicament:  User IDs are no longer simple numeric fields.  Traditionally, an employee's Social Security number may have been used as an id, but HIPAA has put an end to that.

For speech recognition, this could present a common tuning and maintenance issue.  The following XSLT document is designed to automate this maintenance.

First, let's take a look at some simple user IDs stored in my sample database.

SELECT UserIDs FROM SampleIDs

UserIDs                                                     
------
119821
319871
31987M
D19821
D1982M
D19871
...

You can see there are a few alphanumeric patterns being used with these IDs.  Using a simple replacement, we can get all the non-numeric patterns in the database:

SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(UserIDs, '1', '_'), '2', '_'), '3', '_'), '4', '_'), '5', '_'), '6', '_'), '7', '_'), '8', '_'), '9', '_'), '0', '_') AS Pattern,
COUNT(*) AS RecordCount
FROM SampleIDs
GROUP BY REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(UserIDs, '1', '_'), '2', '_'), '3', '_'), '4', '_'), '5', '_'), '6', '_'), '7', '_'), '8', '_'), '9', '_'), '0', '_')

Pattern RecordCount
-------------------
D_____  71680
_____M  57344
X____M  14336
D____M  71680
X_____  14336
______  57344

With this, we'll simply assume there is an even chance for any of these patterns being used on a call.  So, D_____ has the probability of 71680/286720 or a 25% chance of being provided.  Using this information, we can weight the probability of this pattern matching an utterance for a user ID.

First, I run the above SQL statement and put the results into a dataset.  You can simply create the XML from the recordset, too.  My XML results look like this:

<records>
<record>
<Pattern>D_____</Pattern>
<RecordCount>71680</RecordCount>
</record>
<record>
<Pattern>_____M</Pattern>
<RecordCount>57344</RecordCount>
</record>
<record>
<Pattern>X____M</Pattern>
<RecordCount>14336</RecordCount>
</record>
<record>
<Pattern>D____M</Pattern>
<RecordCount>71680</RecordCount>
</record>
<record>
<Pattern>X_____</Pattern>
<RecordCount>14336</RecordCount>
</record>
<record>
<Pattern>______</Pattern>
<RecordCount>57344</RecordCount>
</record>
</records>

Now comes the fun part - using XSLT to create a GRXML grammar file.

A couple of points about the XSL file:

  1. For the JavaScript function buildCharacterArray(s1), you could probably slim the function down to "return s1.split('');"  It splits a string into a character array.
  2. XSLT doesn't have a for-each loop, so I recursively call the buildItem template.
  3. In this example, the TAG is set as an attribute of the related ITEM element.  For other platforms, this may need to be an element trailing the ITEM element.  Of course, the tag syntax is different on each platform:(
  4. In the real world you may find, through transcriptions and tuning, that people may utter dashes and spaces, too, or say, "B as in boy."  They may also truncate leading 0's (00001214V may be spoken 1214V).
  5. It doesn't handle robust recognition for utterances like "nine double oh one seven" or "nine thirty nine twenty two."  Unless you find patterns through transcriptions and tuning, effectively accommodating this will drop your accuracy and performance through the floor.
  6. Use your web server to cache the GRXML output; there's no need to run it too often.

The resulting grammar looks like this :

<?xml version="1.0" encoding="utf-8" ?>
<grammar xml:lang="en-US" version="1.0" root="main" mode="voice" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:nextivr="http://www.nextivr.com/XSLFunctions" xmlns:rs="urn:schemas-microsoft-com:rowset">
<rule id="main" scope="public">
<one-of>
<item weight="0.25">
<ruleref type="application/srgs+xml" uri="#D_____" />
</item>
<item weight="0.2">
<ruleref type="application/srgs+xml" uri="#_____M" />
</item>
<item weight="0.05">
<ruleref type="application/srgs+xml" uri="#X____M" />
</item>
<item weight="0.25">
<ruleref type="application/srgs+xml" uri="#D____M" />
</item>
<item weight="0.05">
<ruleref type="application/srgs+xml" uri="#X_____" />
</item>
<item weight="0.2">
<ruleref type="application/srgs+xml" uri="#______" />
</item>
</one-of>
</rule>
<rule scope="private" id="D_____">
<item tag="D">D</item>
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
</rule>
<rule scope="private" id="_____M">
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<item tag="M">M</item>
</rule>
<rule scope="private" id="X____M">
<item tag="X">X</item>
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<item tag="M">M</item>
</rule>
<rule scope="private" id="D____M">
<item tag="D">D</item>
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<item tag="M">M</item>
</rule>
<rule scope="private" id="X_____">
<item tag="X">X</item>
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
</rule>
<rule scope="private" id="______">
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
<ruleref type="application/srgs+xml" uri="#number" />
</rule>
<rule scope="private" id="number">
<one-of>
<item tag="1">one</item>
<item tag="2">two</item>
<item tag="3">three</item>
<item tag="4">four</item>
<item tag="5">five</item>
<item tag="6">six</item>
<item tag="7">seven</item>
<item tag="8">eight</item>
<item tag="9">nine</item>
<item tag="0">zero</item>
</one-of>
</rule>
</grammar>

Download the code!

6/5/2007 4:02:22 PM (GMT Daylight Time, UTC+01:00)  #    Comments [0]  |  Trackback
Friday, May 11, 2007

I needed to add some simple, free FAX functionality to an application, so I invested a little time in doing it with Visual C# 2005 Express Edition.

Source from the web:

Resulting image:

I decided to make it use web browser content; seeing that it's pretty easy to format a web page, no FAX imaging tools are required. Thanks to Michael McCloskey's  Bitonal article Bitonal (TIFF) Image Converter for .NET, I was able to accomplish my goal rather easily!

I added in some random dithering to produce a decent balance between text pages and pages with images. I was tempted to implement Floyd-Steinberg Dithering, but my random results are good enough for my project.

I also added in code for generating the image from a web page, and creating the multi-page TIFF document.

I reference FaxComEx.dll, the Windows FAX server, in the code.  This makes it easy to generate a FAX by the command line, (e.g. Url2Fax http://localhost/reportapp/report.jsp?repnum=1 18885551212 FAXSVR100 will send a report to 8885551212, using the fax printer on FAXSVR100 (don't use "localhost"))   If you don't want to use the FAX service, leave off the parameters.  You can grab the image from your %TEMP% directory.

Here's the source code: Url2Fax.zip.  The exe is located in the release folder, in case you want to try it as is.  You'll need Windows XP or 2003 to use the built-in Fax delivery.  For Windows 2000, you can generate the image and then send it with the Windows 2000 fax server. (I have a script for that, too).


Enjoy!

5/11/2007 4:21:20 PM (GMT Daylight Time, UTC+01:00)  #    Comments [2]  |  Trackback
Thursday, May 03, 2007

One of the fundamental tasks in creating speech applications is building the grammars for automated speech recognition (ASR).  This entry features techniques to make your grammar-building code fast, efficient and maintainable.

In many situations, it is unrealistic to design and build your grammars in development and deploy them as static grammars in production.  For example, if you are writing an address verification application, you may want to ask the caller for the state, then the city, then the street and so on.  Instead of building many grammars (all the streets in each city, all the cities in each state), you may want to let the user activity decide which of the most popular cities have their street grammars created, and the most popular states have their city grammars cached, too, and so on.

I tried three methods for performing this task.  In all the examples, I connected to a database to retrieve choices for the grammar.
In the first example, I wrote directly to a stream, writing the xml using strings.  In the next example, I used an XML dataset and an XSL stylesheet to transform the data to a grammar.  In the third example, I did the same as the second example, but I sent the results directly to a Response stream in ASP.NET.  Figures 1, 2 and 3 repsectively provide samples of the code.

All 3 performed well.  Using a simple performance measurement of the total processor time used, they were all a fraction of a second.  The top performer by far was using ASP.NET and the response stream.  Of course, IO is the performance killer for the first two; writing the file to a disk address is slow compared to writing to a memory address.

Total seconds of processor time.
Code sample
Grammar Items Code 1 Code 2 Code 3
10 0.891 0.938 0.000!
100 0.906 0.984 0.012
1000 0.938 1.141 0.141


So of course, I suggest you use the code in Figure 3.  Here are some tips on why I think you should prefer it over the code in Figure 1.

  • If you need to customize the grammar, you can change the XSL file without recompiling the code.  Let's say you need to change the TAG element in the grammar (and for each VoiceXML platform, tags are implemented differently!), you can adjust the XSL file and see visually how you're affecting the grammar.
  • By using the XML from the DataSet, you don't have to worry about data types as they're all converted to text.  If a database field changes in size or precision, the code still works without recompiling.
  • XSL is easier to read.  Mind you, to master it takes some work, but which code is easier for an IVR programmer to pick up...
    This:              

    while(TestDataReader.Read())
    {
    TestWriter.Write("<item>{0}</item><tag>colorid = {1};</tag>", TestDataReader.GetString(1), TestDataReader.GetInt32(0));
    }

    Or this?
    <xsl:for-each select="//record">
    <item><xsl:value-of select="description"/></item><tag>colorid = <xsl:value-of select="id"/>;</tag>
    </xsl:for-each>

  • Using Page Output Caching http://msdn2.microsoft.com/en-us/library/ms972362.aspx you can get great performance from the dynamic grammars.  Cache the files fresh every day, based on the URL parameters.  Schedule a task to call the common URLs, so the first caller of the day doesn't have to wait for the first compile (even though it's a fraction of a second).  Cache based on a database dependency - there's examples out there on how to do this.
  • Use web.config to store the SQL queries and XSL file names.  That'll make this code grammar builder really flexible.

 In web.config
 <configuration>
  <appSettings>
   <add key="colors" value="SELECT id, description FROM colors" />
   <add key="colorsxsl" value="SimpleGrammarTransformer.xsl" />

 In your code replace the SQL query with the following:
  System.Configuration.ConfigurationSettings.AppSettings[Request.QueryString.Get("grammar_id")]

 URL to get the colors grammar: 
  http://servername/BuildGrammar?grammar_id=colors

  • Use a SCRIPT block in the XSL to manipulate the data, instead of doing it in the compiled code.  Using script makes it easy to perform Javascript on the XML as it's being processed by the XSL stylesheet.  I've used Javascript to parse comma-delimited strings into grammar items, clean up data, and more.  Perhaps if you need an example, I can post one...

In conclusion, you should use ASP.NET and XSLT to create your dynamic grammars.  It's fast, flexible and easy.  Let me know what you think.  Should I include a download, or can you take if from here?

Have fun!

Figure 1 - Reading from a DB, writing strings to a stream

static void Main(string[] args)
{
TimeSpan TS1 = System.Diagnostics.Process.GetCurrentProcess().TotalProcessorTime;

MySqlConnection DatabaseConnection = new MySqlConnection("Database=;Data Source=;User Id=;Password=");
DatabaseConnection.Open();
MySqlCommand TestCommand = new MySqlCommand("SELECT id, description FROM colors", DatabaseConnection);
MySql.Data.MySqlClient.MySqlDataReader TestDataReader = TestCommand.ExecuteReader(System.Data.CommandBehavior.CloseConnection);
System.IO.StreamWriter TestWriter = new System.IO.StreamWriter("c:\\temp\\Grammar2.grxml");

TestWriter.Write("<?xml version=\"1.0\" encoding=\"utf-8\"?><grammar mode=\"voice\" version=\"1.0\" root=\"main\"><rule id=\"main\"><one-of>");

if (TestDataReader.HasRows)
{
while(TestDataReader.Read())
{
TestWriter.Write("<item>{0}</item><tag>colorid = {1};</tag>", TestDataReader.GetString(1), TestDataReader.GetInt32(0));
}

}

TestWriter.Write("</one-of></rule></grammar>");
TestWriter.Close();

TimeSpan TS2 = System.Diagnostics.Process.GetCurrentProcess().TotalProcessorTime;
Console.WriteLine("Done. Total ticks = {0}.", TS2.Subtract(TS1).Ticks.ToString());
Console.ReadLine();

}

Figure 2 - Reading from a DB to a DataSet, then transforming with XSLT.

static void Main(string[] args)
{
TimeSpan TS1 = System.Diagnostics.Process.GetCurrentProcess().TotalProcessorTime;

MySqlConnection DatabaseConnection = new MySqlConnection("Database=;Data Source=;User Id=;Password=");
MySqlDataAdapter DataAdapter = new MySqlDataAdapter("SELECT id, description FROM colors", DatabaseConnection);
System.Data.DataSet DBDataSet = new System.Data.DataSet();
DataAdapter.Fill(DBDataSet, "record");
XmlDocument XMLTarget = new XmlDocument();
XMLTarget.LoadXml("<records>" + DBDataSet.GetXml() + "</records>");
string XmlTempFile = "c:\\temp\\temprecords.xml";
XMLTarget.Save(XmlTempFile);

string XslFile = "file://c:/temp/SimpleGrammarTransformer.xsl";
System.Xml.Xsl.XslTransform StyleSheet = new System.Xml.Xsl.XslTransform();
XmlUrlResolver URLResolver = new XmlUrlResolver();
StyleSheet.Load(XslFile);
StyleSheet.Transform(XmlTempFile, "c:\\temp\\Grammar1.grxml");

TimeSpan TS2 = System.Diagnostics.Process.GetCurrentProcess().TotalProcessorTime;
Console.WriteLine("Done. Total ticks = {0}.", TS2.Subtract(TS1).Ticks.ToString());
Console.ReadLine();
}

Figure 3 - Reading from a DB and transforming the results to the response stream.

<%@ Page Language="c#" AutoEventWireup="false" Debug="true" %><%@ Import namespace="System.Xml"%><%@ Import namespace="MySql.Data.MySqlClient"%><%

TimeSpan TS1 = System.Diagnostics.Process.GetCurrentProcess().TotalProcessorTime;

MySqlConnection DatabaseConnection = new MySqlConnection("Database=;Data Source=;User Id=;Password=");
MySqlDataAdapter DataAdapter = new MySqlDataAdapter("SELECT id, description FROM colors", DatabaseConnection);
System.Data.DataSet DBDataSet = new System.Data.DataSet();
DataAdapter.Fill(DBDataSet, "record");
XmlDocument XMLTarget = new XmlDocument();
XMLTarget.LoadXml("<records>" + DBDataSet.GetXml() + "</records>");

string XslFile = String.Format("file://{0}", Server.MapPath("SimpleGrammarTransformer.xsl")).Replace("\\", "/");
System.Xml.Xsl.XslTransform StyleSheet = new System.Xml.Xsl.XslTransform();
XmlUrlResolver URLResolver = new XmlUrlResolver();
//StyleSheet.Load(XslFile, URLResolver);
StyleSheet.Load(XslFile, URLResolver);

StyleSheet.Transform(XMLTarget, null, Response.OutputStream, null);

TimeSpan TS2 = System.Diagnostics.Process.GetCurrentProcess().TotalProcessorTime;

Response.Write(String.Format("<!--Done. Total ticks = {0} .-->", TS2.Subtract(TS1).Ticks.ToString()));

%>

Figure 4 - An XSLT file for building GRXML grammars

You can always change this so it outputs ABNF, or any GSL.


<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="
http://www.w3.org/1999/XSL/Transform"
  xmlns:msxsl="urn:schemas-microsoft-com:xslt"
  exclude-result-prefixes="msxsl"
>
<xsl:output method="xml"/>
  <xsl:template match="/">
    <grammar mode="voice" version="1.0" root="main" >
      <rule id="main">
        <one-of>
          <xsl:for-each select="//record">
            <item><xsl:value-of select="description"/></item><tag>colorid = <xsl:value-of select="id"/>;</tag>         
          </xsl:for-each>
        </one-of>
      </rule>
    </grammar>
  </xsl:template>
</xsl:stylesheet>


Figure 5 - GRXML result
<?xml version="1.0" encoding="utf-8"?>
<grammar mode="voice" version="1.0" root="main">
  <rule id="main">
  <one-of>
    <item>red</item><tag>colorid = 1;</tag>
    <item>orange</item><tag>colorid = 2;</tag>
    <item>yellow</item><tag>colorid = 3;</tag>
    .
    .
    .   
  </one-of>
  </rule>
</grammar>

 

5/3/2007 12:58:11 AM (GMT Daylight Time, UTC+01:00)  #    Comments [0]  |  Trackback
Wednesday, March 07, 2007

Hi All,

Seeing that some of the google search traffic I receive is around NMS sound files, let me provide a little insight.

NMS sound files are recorded in their own proprietary format, optimized for quality and performance.

The NMS vox files are NOT the same as the Dialogic VOX files.

If you have some NMS VOX files that you need to convert, the easiest way is to convert them where some NMS software is installed (probably on the IVR machine itself).

Here's a Windows command line command to convert a folder of  NMS files to WAV files:

for %1 in (*.vox) do VCECOPY %1 %~n1.wav -c44M16

-c44M16 means output encoding is 44mhz mono 16-bit.

If the NMS file is indexed (use VCEINFO to figure it out), meaning it contains more than one recording - kind of like a ZIP file contains a bunch of files - you'll have to use a manual technique something like the following:

vcecopy messages.vox 0.wav -c44M16 -m0,0

vcecopy messages.vox 1.wav -c44M16 -m1,0

vcecopy messages.vox 2.wav -c44M16 -m2,0

Using Excel, you can write some equations to build a list of commands.  Using some advanced command line utilities - perhaps Windows PowerShell or grep, depending on the platform you are using.

If you need any help with decoding/encoding from one format to another, drop me a line.  NMS, Dialogic, Talx, raw PCM, GSM, whatever...

 

3/7/2007 3:02:13 PM (GMT Standard Time, UTC+00:00)  #    Comments [3]  |  Trackback
Friday, August 18, 2006

Seeing that it's been a while, I figure I can let you know what I'm up to.  I've been programming for clients; some VoiceXML, some web apps, some legacy IVRs.

In my spare time I've been building out an Asterisk adapter for our Arca applications.  It's pretty easy to install, etc.  Basically, our IVR applications can be written once and run on VoiceXML, MSS, proprietary IVR platforms, and "non-standard" IVRs like Asterisk.

If you're curious about Asterisk but know nothing about Linux, I can have you running an Asterisk switch on your workstation in two hours.

If you're curious about Arca, let me know.  We have videos of how it all works.

If you have any general IVR questions - any platform, any topic - drop me a line!

8/18/2006 9:20:29 PM (GMT Daylight Time, UTC+01:00)  #    Comments [0]  |  Trackback
Tuesday, May 16, 2006

I had someone ask, "Why is it that my cheap $20 sound card can play more sounds than a telephony card that costs thousands of dollars?"  He was asking about a quad span T1 he purchased for his IVR application.  Basically, a T1 is an ethernet connection that can manage 24 phone lines (30 outside of the USA).  In the world of IVRs, it's a lot easier to wire, and manage, a single ethernet connection instead of wiring 24 analog phone lines.

First, let me provide some background on telephony cards.  I'll address why they're so expensive, and then I'll address the WAV part of the question.

Basically, a telephony card offloads the telephony processing from the PC's CPUs, making it easy for apps to place a call, play a sound file, collect DTMF or request speech recognition, transfer a call and so on.  Just like you have video and audio processors in your PC handling all of the video and audio signaling.

Here are some of the processing tasks handled by telephony.

Multiplexing

The phone company takes 24 phone lines assigned to you, digitizes the signal, and merges the results into one big data stream.  On the other end the telephony card takes the signal and starts breaking up that stream like a poker dealer.  Well, this dealer has 24 players to deal to.  Every few milliseconds he deals a little bit of signal to a phone line and then moves on to the next one.  Technically, this is Time-Division Multiplexing or TDM.

Signaling

After multiplexing the signal, there is additional information placed in the digitized sound stream.  Since this is a digital signal, there is no way for an IVR system to perform a "flash hook" like we do at home to answer call waiting or to make a 3-way conference call.  The telephony card and telephone switch commonly talk to each other via signalling bits.  Hidden in the sound streams are bits of information that don't affect the audio, but provide a way for the ends to pass information. 

Going off on a tangent: Without going into it too much, I just want to say that there are many standards used to perform signaling.  Don't overlook this!!!  Have a technical professional install a T1 telephony card for you.  Otherwise you'll probably spend days trying to get it to work.  Also, how your IVR handles calls and flash-hooks is different on different signaling.  Some configurations don't support flash-hook and blind transfers so you'll have to use feature codes (like dialing *29 instead) or you'll have to trombone or bridge transferred calls.

Another tangent: If you're thinking it sounds like you don't want to use a T1 and VOIP won't require all this overhead, think again!  VOIP has to handle all of this, too, but in different ways.  You can get telephony cards that handle all the VOIP processing for you, and they're probably easier to configure than a TDM T1.  If you expect to handle any kind of volume and you don't want to buy a telephony card of some sort, plan on dedicating a few high-grade CPUs for handling the telecommunications signaling and another CPU for your IVR applications. 

Advanced Features

Modern T1 cards have additional to make life easy on the main CPUs.  Echo cancellation (echoes are annoying and can confuse speech recognition, DTMF, callers...), voice activity detection (is someone on the line?), answering machine detection (each T1 card vendor has their own algorithm), and conference bridging are some of these.

To address the cost of the card

All of the processors for handling the telephony are packed on the telephony card.  When your OS loads the card it can usually provide firmware files that among other things set up the card capabilities.  So you're spending thousands of dollars on a telephony card that will do a lot more than play sound files.

Now, on to the playing of Microsoft WAV files.

Have you ever copied a song to your hard drive as a WAV file and then converted it to an MP3?  Notice the size difference?  That's because WAV files are generally the capture of the raw digital signal received from the sound device (microphone, CD player, etc.)  Just like zipping a file compresses your desktop files, a CODEC (code-decode) finds patterns in the raw sounds and compresses and encodes it into a smaller form.  It can also take an encoded file and decode it so it can be heard through your speakers or over a telephone line.

So, why by default do the boards not provide a CODEC for Microsoft WAV files?  Well, here's my guess.  Seeing that the telecommunications industry has traditionally been proprietary - not into standardizing - and each board vendor would have their own "best compression, best quality" codec.  They also provide other common industry codecs to support existing methods of encoding sound files.  When you look at the total number of codecs, frequencies and bit-rates, you can't expect a phone channel to be ready to choose any of them in real time during a live call, so you usually load some of the codecs, not all of them.  All of the supported CODECs fit into a telephony card's requirements (memory, processing power, bus bandwidth) for playing/recording sound on 24-96 simultaneous phone calls.

Microsoft WAVs are traditionally found in home and office windows-based applications.  They are for a single - or dual, stereo - sound channel, capable of higher quality than a telephone line.  There is no need for compression because the PC speakers are part of a whole computer system with excessive processing cycles and memory.

Two different industries, two different end devices, two different requirements.

Some boards may have codecs to support certain WAV formats, but if you're using uncompressed WAV files you're going to chew up a lot of disk space and use a lot of bandwidth between the sound card and the PCs disk drive (or memory cache, God willing).

To be snippy:
Microsoft Media Player doesn't play Dialogic VOX files or Natural Microsystems VOX files.  Why not?  iPods don't play WAV files.  Why not?  Everyone uses MP3s for songs, not WAVs.  Why not?  Different requirements, different priorities, different codecs.  Use the right codec for your system.

Do you have any contentions?  Please post them!  Any IVR Sales Engineers out there who want to comment?  I wouldn't mind the clarification;)  If you're looking for audio codecs for different platforms, let me know.  I have access to some proprietary formats, too.

5/16/2006 2:27:54 PM (GMT Daylight Time, UTC+01:00)  #    Comments [1]  |  Trackback

Theme design by Jelle Druyts

Pick a theme: