Click here to Subscribe

BPL
LMDS
GPU
VoP
OLED
DSP
Opera Browser
The FCC
More...

View this feed in your browser

Other Services:


Search All Issues, Conference Reports and Tutorials

Web Services Summit

Fair Use or Copyright?

Deregulation Smoke and Mirrors

More...

 

SpeechTek 2005
By John Latta, WAVE 0534 8/26/05

New York, NY
August 1 – 3, 2005

SpeechTEK is about speech recognition solutions. It is also about how speech technology can reduce costs or make devices more convenient to use. We also came away with a perspective that speech technology is but another biometric applied to useful problems.

Speech technology is another niche technology populated by a small group of individuals with a strong research background and a few companies. SpeechTEK is an industry event focused on applications and products. Thus, the number of applications is growing as the technology continuously improves. For example, speech is playing an increasing role in call centers. Speakers described ROIs in months. Yet, this technology is also present in many cell phones and the convenience it brings is increasingly taken for granted. Speech technology is reaching mainstream usage.


Microsoft – A Long Term Investment in Speech Technology

Julian Odell, in Speech and Natural Language, Microsoft, provided an overview of speech engines at Microsoft. The speech engine development is a component, along with natural language components, which are used in many Microsoft products. The original speech work dates to 1993 with the license to the CMU Sphinx-II speech recognizer. In 1999 Entropic in Cambridge, UK was acquired.

Products under development which will include speech technology are: Windows Vista, WinFX, Speech Server, and Windows Mobil. Microsoft intends to achieve 15% improvement in the accuracy each year and it has shown the ability to accomplish that if not better.


Voice Signal – Driving Speech Applications to the Cell Phone

The CTO of Voice Signal described their voice engine technology. They have a “small speech engine” which has been applied to cell phones. There are two drivers of what can be done in a cell phone: the bandwidth of the network and the CPU power in the phone itself. The DSP technology in the phone is not being used for speech but the CPU is.

One of the drivers of speech technology is customer acceptance. It was shown that 85% of the cell phone buyers are either somewhat, very or extremely interested in voice activation.

To accomplish a speech engine on a cell phone, VoiceSignal began from scratch. The process is described as analyzing all the operations required to recognize speech. This included: examine every calculation, approximate what is difficult to compute, throw out everything not essential and do the reexamination process over and over again. This was done to create the lowest footprint engine in a limited capability cell phone.

So far, their engine has been implemented in 57 different phone models. Some of the new capabilities include a speaker independent multimodal interface. That is, the speaker independently dialing and dialing from the phone book. It is also possible to launch applications with speech.


Motorola – Taking phones to the Next Step with Voice

David Pearce, Motorola Labs, UK, described next generation voice technology in cellular phones. Multimodal is the combination of modalities which include speech as one. David gave the illustration of user commands via speech or the keypad. The system could respond with either speech or sounds. The advantages claimed are:

In some uses, speech is better than a keypad;
Access is possible in hands busy/eyes busy situations;
User flexibility where the user chooses the mode of
interaction.

The second technology David spoke of is Distributed Speech Recognition (DSR). This allows speech recognition to take place outside of the phone, usually on a voice gateway. One advantage of the approach is that, with a DSR front end, the transmission of the voice for recognition is packetized for the IP network. There are DSR standards already developed by ETSI, 3GPP and IETF. Although David was optimistic that this would be implemented in the next 18 – 24 months, some questioned if we will see the technology some years ahead.

One of the advantages of this architecture is that it allows operators to make more money with hosted services. As David said, the significant carrier revenues are coming from:

Ring tones
SMS
Ring Back

And, the ability to gain another revenue stream with voice services is quite appealing.


Wyndham Hotels – Rapid Payback for Speech

David Mussa, VP of Reservations of Wyndham International, described how they implemented speech in the call centers. This is a hosted ASP service provided by Voxify. So far they have implemented via voice: Welcome, Hotel Information, Confirm and Cancel. In Q3 of this year they will have online, as a speech application, reservations and then in Q4 the hotel loyalty program. Some of the performance statistics were striking:

Deployment: 6 – 8 weeks
Change Requests: Days
ROI < 2 months
Per call savings: 85%


1- 800 Flowers – Another Payback Story

1-800 Flowers is a nationwide retail seller of flowers, plants, gourmet foods and more. They have an Internet, phone and retail presence. The call volume is 30,000 per month, with 200,000 on peak days. With speech they were able to make 2/3 of those calls automated and as a result there was a 75% decline in call handing time.


Microphones and More Microphones

The WAVE wondered why are microphones not discussed in these sessions? Is not speech S/N ratio important? We spoke with LumenVox. The conversation began with a question – why is noise not discussed in the presentations on voice recognition?

Keep in mind that nearly all speech recognition applications are centered around an individual speaking on a telephone. This is a well defined acoustic environment with the microphone near the mouth. Noise can be an issue but we handle it with noise reduction software. To date, the industry has been focused on this narrow speaking condition.

There are three types of noise which interfere with voice recognition.

White Noise – this is the easiest to remove.

Stationary Noise – this noise that has consistent characteristics.

Non-Stationary noise – this is the most difficult to remove as its properties vary in time and spectrum.

The class of noise which causes the most problems is another individual speaking – a form of non-stationary noise.

All the recognition products have some form of noise cancellation that mitigate the impacts of noise on speech.

The real challenge lies when voice recognition moves away from the telephony environment. This includes:

Larger distances between the speaker and the microphone;

Home environments with many other noise sources, including multiple individuals speaking and background television; and

Other environments including the office or industrial situations.

Products which fit into these environments are only now surfacing.

It is in these different environments where more than one microphone makes a significant difference. Keep in mind that one wants to isolate the speaker and in so doing eliminate or significantly reduce other sounds which are considered noise. More than one microphone allows for directionality like one’s ears. Thus, the isolation of a speaker is much easier.

It could well be that only 2 microphones are adequate but not enough research has been done to know the tradeoffs between noise reduction and the number of microphones and the various use environments.

The microphone requirements for a non-telephony product could well be product-specific based on the use environment and the expectation for speech recognition performance.

When there are multiple speakers in the environment, it could well be that the speech recognizer could isolate the speakers based on their speech characteristics and thus discriminate the one that is required. This is one form of noise cancellation using the recognizer.

Keep in mind, that today such multiple microphone designs are not required. Further, to mandate multiple microphones is to change the infrastructure and this does not happen quickly. Thus, we are likely to see multiple microphone designs only where they are required.


Voice Verification – Niche Gets Established

The WAVE had dismissed the performance of voice as a biometric until we heard some of the arguments of the voice verification suppliers here at SpeechTEK.

Speech is the only biometric that can be used at a distance and does not require new infrastructure, i.e., a finger print or Iris reader at every location. As a result, it has established a position in password resets, electronic wire transfers and caller identification verification inbound to call centers. The WAVE spoke with four vendors and heard presentations on the role that speech verification is playing.


PERSAY

PERSAY is an Israeli based company focused on biometric speaker identification. It has three products:

VocalPassword

This is a text dependent biometric speaker verification system that verifies a speaker in real time.

FreeSpeech

Based on unique text speaker verification is determined during natural conversation.

S.P.I.D.

A voice mining and speaker identification system for law enforcement and intelligence agencies.

S.P.I.D. does one-to-many matching.

PERSAY can dynamically set the system security threshold based on need and actual performance. The demonstration was impressive. Over a period of time, the manager of the speech verification application can log the performance of the system including rejected calls, identified speakers and suspected impostors. The system then constructs FAR and FRR curves. The administrator can adjust the threshold of acceptance and rejection based on what is acceptable from these curves.

The CEO of PERSAY gave a presentation which he described a recent implementation of their voice verification system in a large financial institution in New York City. The integration effort was challenging because of the large number of stakeholders and their individual requirements.

Security department
IT and system administration
Helpdesk
Telephony platform integrator
Project management

One wondered if this is a sign of the future as biometrics becomes more integrated into the enterprise.


Voice Trust AG

Voice Trust, which has been used for a number of years in Europe, is now entering the U.S. market. It claims that it is the only voice verification technology to have achieved a CC (Common Criteria) rating. Its rating is EAL2Medium.

The Common Criteria for Information Technology (IT) Security Evaluation, also known as the ISO 15408 standard, is the new standard for specifying and evaluating the security features of computer products and systems.

Common Criteria is the first international standard for IT security evaluation and validation/certification for the National Information Assurance Partnership (NIAP). NIAP is a joint program sponsored by the National Security Agency (NSA) and the National Institute of Standards and Technology (NIST).

http://niap.nist.gov/cc-scheme/index.html

The product offerings include:

VOICE TRUST Password Reset Plug-In

This asks the user for the ID and then one or more challenge/response phrases.

VOICE TRUST Two Factor Authentication

The user must first cite a PIN or unique code and this is voice authenticated. Then the system calls back the same user for voice authentication

In Europe, IBM is an integration partner.


Diaphonics

Diaphonics has been in voice verification for 3.5 years with most of its customers being in the financial sector. Applications include:

Password reset
Wire transfer individual verification
User authentication

Diaphonics provides an integrated hardware and software solution. The box fits in a rack and a T-1 line is connected to it. They have found that their customers want a turn key solution and this has driven their approach.

One of the advantages of speech is that it is the only biometric that can be used at a distance. It is not practical to give out fingerprint readers to all and the performance of voice has become acceptable to financial institutions. It is their experience that financial institutions error on the side of caution. During the installation process, they adjust the threshold level but have not found it necessary to go back and tune the system.

In a presentation, the President and CEO, Andy Osbum, stated that there are poor voice verification applications. These happen when:

There is no clear business case;
Unrealistic performance requirements;
No practical way to enroll users and
It is the wrong biometric for the situation.

Thus, they want to see demonstrable ROI, an addressable security gap and supportive internal and external user. The specific example cited was password resets which is a good fit.


NICE

NICE provides similar technology for financial institutions. They allow the customers to collect voice prints to be used for detecting future fraudulent intent. This is a one to many matching application. However, the scale remains relatively small. One of the issues remains how the scale the matching technology when there is a large data base of voice prints.


Point to Ponder – Unified Theory of Biometrics

As the WAVE sat in on the presentations about speech recognition technology, we wondered – Is this not just another biometric?

Ponder the following definition of biometrics:

Biometrics is the interface of one or more human characteristics to technology.

Speech is just a biometric. Individual speech recognition is the linkage of that biometric to an individual. As we have stated before biometrics does not have to be linked to security to be useful and some of the most interesting applications are not security related. This was very much the case in today’s presentations. Speech recognition is a form of pattern matching based on language and speech characteristics.

This perspective links to what was heard at AVBPA. Some of the most interesting applications of biometric technology are in cell phones. At AVBPA, we saw facial modeling used in phones while here at SpeechTEK, speech enhances the user interface. Thus, speech brings a strong contribution to convenience. We have also seen this on the security side of biometrics where individual identification can add convenience.

Biometrics when seen from multiple perspectives is actually a unifying technology for the human interface to technology.


Microsoft Cites the Missing Factor in Speech – Where is the Value Proposition?

Steve Change, Program Manager, Microsoft Speech Server gave presentation on Multimodality in Consumer Electronics. He stated that missing from the discussion on speech is the consumer value proposition- what are the costs and benefits? Some are claiming that speech has crossed the chasm but Steve challenges this. The applications today are narrow and have yet to reach mass market acceptance. Until serious examination is performed on the value proposition, outside of the obvious enterprise ROI arguments, the technology will remain in niche silos.


WAVE Comments

At SpeechTek, there was striking uniformity in how speech is used in individual verification:

Financial institutions - two primary applications

Password resets - Electronic funds transfer

Remote identification - speech is used for remote authentication. However, there are limited capabilities for speech identification

At the WAVE, we see a parallel with how fingerprints are used in large scale programs.

In US-VISIT, there is secondary screening which provides a second tier human assessment of the fingerprint when the confidence level of the match is low.

In voice verification, if the voice verification is below an acceptable threshold, secondary actions are taken including call back, vector on a live operator or challenge response to the speaker.

Thus, speech technology, in spite of being remote, is not just stand alone but a part of a multitier system. However, if a fingerprint is used as the entry device on the desktop, most solutions do not have such a multitier approach.

We found it interesting that password resets is one of the most common applications for speaker verification. Functionally, as long as one can get a password, this is the same as domain log on.

Not once were the words “identity management” used at SpeechTEK, but speech is a part of this. Users had their identity managed when they accessed their accounts or changed passwords. Just as we saw a broad view of identity management at Digital ID World, speech is establishing a role in the enterprise which is consistent with and a component of identity management.

 

Comments?
E-mail webmaster
Page updated 1/24/07
Copyright 4th Wave Inc, 2007