
SpeechTek 2005
By John Latta, WAVE
0534 8/26/05
New York, NY
August 1 – 3, 2005
SpeechTEK is about speech recognition solutions. It is
also about how speech technology can reduce costs or make devices more
convenient to use. We also came away with a perspective that speech technology
is but another biometric applied to useful problems.
Speech technology is another niche technology populated
by a small group of individuals with a strong research background and
a few companies. SpeechTEK is an industry event focused on applications
and products. Thus, the number of applications is growing as the technology
continuously improves. For example, speech is playing an increasing role
in call centers. Speakers described ROIs in months. Yet, this technology
is also present in many cell phones and the convenience it brings is
increasingly taken for granted. Speech technology is reaching mainstream
usage.
Microsoft – A Long Term Investment in Speech
Technology
Julian Odell, in Speech and Natural Language, Microsoft,
provided an overview of speech engines at Microsoft. The speech engine
development is a component, along with natural language components, which
are used in many Microsoft products. The original speech work dates to
1993 with the license to the CMU Sphinx-II speech recognizer. In 1999
Entropic in Cambridge, UK was acquired.
Products under development which will include speech technology
are: Windows Vista, WinFX, Speech Server, and Windows Mobil. Microsoft
intends to achieve 15% improvement in the accuracy each year and it has
shown the ability to accomplish that if not better.
Voice Signal – Driving Speech Applications
to the Cell Phone
The CTO of Voice Signal described their voice engine technology.
They have a “small speech engine” which has been applied
to cell phones. There are two drivers of what can be done in a cell phone:
the bandwidth of the network and the CPU power in the phone itself. The
DSP technology in the phone is not being used for speech but the CPU
is.
One of the drivers of speech technology is customer acceptance.
It was shown that 85% of the cell phone buyers are either somewhat, very
or extremely interested in voice activation.
To accomplish a speech engine on a cell phone, VoiceSignal
began from scratch. The process is described as analyzing all the operations
required to recognize speech. This included: examine every calculation,
approximate what is difficult to compute, throw out everything not essential
and do the reexamination process over and over again. This was done to
create the lowest footprint engine in a limited capability cell phone.
So far, their engine has been implemented in 57 different
phone models. Some of the new capabilities include a speaker independent
multimodal interface. That is, the speaker independently dialing and
dialing from the phone book. It is also possible to launch applications
with speech.
Motorola – Taking phones to the Next Step
with Voice
David Pearce, Motorola Labs, UK, described next generation
voice technology in cellular phones. Multimodal is the combination of
modalities which include speech as one. David gave the illustration of
user commands via speech or the keypad. The system could respond with
either speech or sounds. The advantages claimed are:
In some uses, speech is better than a keypad;
Access is possible in hands busy/eyes busy situations;
User flexibility where the user chooses the mode of
interaction.
The second technology David spoke of is Distributed Speech
Recognition (DSR). This allows speech recognition to take place outside
of the phone, usually on a voice gateway. One advantage of the approach
is that, with a DSR front end, the transmission of the voice for recognition
is packetized for the IP network. There are DSR standards already developed
by ETSI, 3GPP and IETF. Although David was optimistic that this would
be implemented in the next 18 – 24 months, some questioned if we
will see the technology some years ahead.
One of the advantages of this architecture is that it allows
operators to make more money with hosted services. As David said, the
significant carrier revenues are coming from:
Ring tones
SMS
Ring Back
And, the ability to gain another revenue stream with voice
services is quite appealing.
Wyndham Hotels – Rapid Payback for Speech
David Mussa, VP of Reservations of Wyndham International,
described how they implemented speech in the call centers. This is a
hosted ASP service provided by Voxify. So far they have implemented via
voice: Welcome, Hotel Information, Confirm and Cancel. In Q3 of this
year they will have online, as a speech application, reservations and
then in Q4 the hotel loyalty program. Some of the performance statistics
were striking:
Deployment: 6 – 8 weeks
Change Requests: Days
ROI < 2 months
Per call savings: 85%
1- 800 Flowers – Another Payback Story
1-800 Flowers is a nationwide retail seller of flowers,
plants, gourmet foods and more. They have an Internet, phone and retail
presence. The call volume is 30,000 per month, with 200,000 on peak days.
With speech they were able to make 2/3 of those calls automated and as
a result there was a 75% decline in call handing time.
Microphones and More Microphones
The WAVE wondered why are microphones not discussed in
these sessions? Is not speech S/N ratio important? We spoke with LumenVox.
The conversation began with a question – why is noise not discussed
in the presentations on voice recognition?
Keep in mind that nearly all speech recognition applications
are centered around an individual speaking on a telephone. This is
a well defined acoustic environment with the microphone near the mouth.
Noise can be an issue but we handle it with noise reduction software.
To date, the industry has been focused on this narrow speaking condition.
There are three types of noise which interfere with
voice recognition.
White Noise – this is the easiest to remove.
Stationary Noise – this noise that has consistent
characteristics.
Non-Stationary noise – this is the most difficult
to remove as its properties vary in time and spectrum.
The class of noise which causes the most problems is
another individual speaking – a form of non-stationary noise.
All the recognition products have some form of noise
cancellation that mitigate the impacts of noise on speech.
The real challenge lies when voice recognition moves
away from the telephony environment. This includes:
Larger distances between the speaker and the microphone;
Home environments with many other noise sources, including
multiple individuals speaking and background television; and
Other environments including the office or industrial
situations.
Products which fit into these environments are only
now surfacing.
It is in these different environments where more than
one microphone makes a significant difference. Keep in mind that one
wants to isolate the speaker and in so doing eliminate or significantly
reduce other sounds which are considered noise. More than one microphone
allows for directionality like one’s ears. Thus, the isolation
of a speaker is much easier.
It could well be that only 2 microphones are adequate
but not enough research has been done to know the tradeoffs between
noise reduction and the number of microphones and the various use environments.
The microphone requirements for a non-telephony product
could well be product-specific based on the use environment and the
expectation for speech recognition performance.
When there are multiple speakers in the environment,
it could well be that the speech recognizer could isolate the speakers
based on their speech characteristics and thus discriminate the one
that is required. This is one form of noise cancellation using the
recognizer.
Keep in mind, that today such multiple microphone designs
are not required. Further, to mandate multiple microphones is to change
the infrastructure and this does not happen quickly. Thus, we are likely
to see multiple microphone designs only where they are required.
Voice Verification – Niche Gets Established
The WAVE had dismissed the performance of voice as a biometric
until we heard some of the arguments of the voice verification suppliers
here at SpeechTEK.
Speech is the only biometric that can be used at a distance
and does not require new infrastructure, i.e., a finger print or Iris
reader at every location. As a result, it has established a position
in password resets, electronic wire transfers and caller identification
verification inbound to call centers. The WAVE spoke with four vendors
and heard presentations on the role that speech verification is playing.
PERSAY
PERSAY is an Israeli based company focused on biometric
speaker identification. It has three products:
VocalPassword
This is a text dependent biometric speaker verification
system that verifies a speaker in real time.
FreeSpeech
Based on unique text speaker verification is determined
during natural conversation.
S.P.I.D.
A voice mining and speaker identification system
for law enforcement and intelligence agencies.
S.P.I.D. does one-to-many matching.
PERSAY can dynamically set the system security threshold
based on need and actual performance. The demonstration was impressive.
Over a period of time, the manager of the speech verification application
can log the performance of the system including rejected calls, identified
speakers and suspected impostors. The system then constructs FAR and
FRR curves. The administrator can adjust the threshold of acceptance
and rejection based on what is acceptable from these curves.
The CEO of PERSAY gave a presentation which he described
a recent implementation of their voice verification system in a large
financial institution in New York City. The integration effort was
challenging because of the large number of stakeholders and their individual
requirements.
Security department
IT and system administration
Helpdesk
Telephony platform integrator
Project management
One wondered if this is a sign of the future as biometrics
becomes more integrated into the enterprise.
Voice Trust AG
Voice Trust, which has been used for a number of years
in Europe, is now entering the U.S. market. It claims that it is the
only voice verification technology to have achieved a CC (Common Criteria)
rating. Its rating is EAL2Medium.
The Common Criteria for Information Technology (IT)
Security Evaluation, also known as the ISO 15408 standard, is the
new standard for specifying and evaluating the security features
of computer products and systems.
Common Criteria is the first international standard
for IT security evaluation and validation/certification for the National
Information Assurance Partnership (NIAP). NIAP is a joint program
sponsored by the National Security Agency (NSA) and the National
Institute of Standards and Technology (NIST).
http://niap.nist.gov/cc-scheme/index.html
The product offerings include:
VOICE TRUST Password Reset Plug-In
This asks the user for the ID and then one or more
challenge/response phrases.
VOICE TRUST Two Factor Authentication
The user must first cite a PIN or unique code and
this is voice authenticated. Then the system calls back the same
user for voice authentication
In Europe, IBM is an integration partner.
Diaphonics
Diaphonics has been in voice verification for 3.5 years
with most of its customers being in the financial sector. Applications
include:
Password reset
Wire transfer individual verification
User authentication
Diaphonics provides an integrated hardware and software
solution. The box fits in a rack and a T-1 line is connected to it.
They have found that their customers want a turn key solution and this
has driven their approach.
One of the advantages of speech is that it is the only
biometric that can be used at a distance. It is not practical to give
out fingerprint readers to all and the performance of voice has become
acceptable to financial institutions. It is their experience that financial
institutions error on the side of caution. During the installation
process, they adjust the threshold level but have not found it necessary
to go back and tune the system.
In a presentation, the President and CEO, Andy Osbum,
stated that there are poor voice verification applications. These happen
when:
There is no clear business case;
Unrealistic performance requirements;
No practical way to enroll users and
It is the wrong biometric for the situation.
Thus, they want to see demonstrable ROI, an addressable
security gap and supportive internal and external user. The specific
example cited was password resets which is a good fit.
NICE
NICE provides similar technology for financial institutions.
They allow the customers to collect voice prints to be used for detecting
future fraudulent intent. This is a one to many matching application.
However, the scale remains relatively small. One of the issues remains
how the scale the matching technology when there is a large data base
of voice prints.
Point to Ponder – Unified Theory of Biometrics
As the WAVE sat in on the presentations about speech recognition
technology, we wondered – Is this not just another biometric?
Ponder the following definition of biometrics:
Biometrics is the interface of one or more human characteristics
to technology.
Speech is just a biometric. Individual speech recognition
is the linkage of that biometric to an individual. As we have stated
before biometrics does not have to be linked to security to be useful
and some of the most interesting applications are not security related.
This was very much the case in today’s presentations. Speech recognition
is a form of pattern matching based on language and speech characteristics.
This perspective links to what was heard at AVBPA. Some
of the most interesting applications of biometric technology are in cell
phones. At AVBPA, we saw facial modeling used in phones while here at
SpeechTEK, speech enhances the user interface. Thus, speech brings a
strong contribution to convenience. We have also seen this on the security
side of biometrics where individual identification can add convenience.
Biometrics when seen from multiple perspectives is actually
a unifying technology for the human interface to technology.
Microsoft Cites the Missing Factor in Speech – Where
is the Value Proposition?
Steve Change, Program Manager, Microsoft Speech Server
gave presentation on Multimodality in Consumer Electronics. He stated
that missing from the discussion on speech is the consumer value proposition-
what are the costs and benefits? Some are claiming that speech has crossed
the chasm but Steve challenges this. The applications today are narrow
and have yet to reach mass market acceptance. Until serious examination
is performed on the value proposition, outside of the obvious enterprise
ROI arguments, the technology will remain in niche silos.
WAVE Comments
At SpeechTek, there was striking uniformity in how speech
is used in individual verification:
Financial institutions - two primary applications
Password resets - Electronic funds transfer
Remote identification - speech is used for remote authentication.
However, there are limited capabilities for speech identification
At the WAVE, we see a parallel with how fingerprints are
used in large scale programs.
In US-VISIT, there is secondary screening which provides
a second tier human assessment of the fingerprint when the confidence
level of the match is low.
In voice verification, if the voice verification is
below an acceptable threshold, secondary actions are taken including
call back, vector on a live operator or challenge response to the speaker.
Thus, speech technology, in spite of being remote, is not
just stand alone but a part of a multitier system. However, if a fingerprint
is used as the entry device on the desktop, most solutions do not have
such a multitier approach.
We found it interesting that password resets is one of
the most common applications for speaker verification. Functionally,
as long as one can get a password, this is the same as domain log on.
Not once were the words “identity management” used
at SpeechTEK, but speech is a part of this. Users had their identity
managed when they accessed their accounts or changed passwords. Just
as we saw a broad view of identity management at Digital ID World, speech
is establishing a role in the enterprise which is consistent with and
a component of identity management.
|