Android: Ducking Audio with TextToSpeech
The Android app I'm working on makes use of TextToSpeech. It's also very likely that the user is playing music in the background. In order for the music to not drown out the TTS output, we'll make use of audio ducking, a feature that's "new" as of API level 8 (Android 2.2).
The ingredients needed for this is an instance of the TextToSpeech class and the AudioManager.
In order to get spoken text, we use TextToSpeech.speak(String text, int mode, HashMap<String,String> params);
Now, in order to properly get any background music to duck when we start speaking, and raise the volume when we're done speaking, we'll need to know when the TTS is done blabbering. And for this, we can register a callback via TextToSpeech.setOnUtteranceCompletedListener(). There are, however, a few pitfalls that we'll need to watch out for:
- TextToSpeech.setOnUtteranceCompletedListener() MUST be invoked after onInit() was called. Otherwise it won't necessarily register properly, and onUtteranceCompleted() may never be called.
- When invoking TextToSpeech.speak(), we must provide an "Utterance ID" via the params HashMap using TextToSpeech.Engine.KEY_PARAM_UTTERANCE_ID. If there's no ID, onUtteranceCompleted() will not be invoked.
- Because it is possible to queue up more utterances via TextToSpeech.speak() before the previous one has completed, we'll need to track how many we've queued up, so we know to only release audio focus after all of them have been completed. Otherwise we've got a race condition where we queued up two things to speak, but lose focus after the first one has completed. The result is a mess where music will cut in and out while text is being spoken.
With this in mind, we'll create a wrapper class which will encapsulate this functionality.
public class DuckingTTS {
// Log tag
private final String TAG = DuckingTTS.class.getName();
// Debugging is on (set to false to cut down on spam)
private static final boolean D = false;
/**
* Which audio stream to use. We use the music one here,
* so that it'll be "in line" with the music volume from whatever is playing.
*/
private final int STREAM_TYPE = AudioManager.STREAM_MUSIC;
/**
* Parameters we're feeding with each invocation of speak()
*/
private HashMap ttsParams;
/**
* How many utterances are playing at a particular moment.
*/
private int mUtterancesPlaying = 0;
/**
* The text-to-speech engine
*/
private TextToSpeech mTts;
/**
* Whether TTS is initialized (onInit() called yet?)
*/
private boolean mIsInitialized = false;
/**
* Queue up chatter when TTS is not initialized yet
*/
private List queue = new ArrayList();
/**
* AudioManager injected via RoboGuice
*/
@Inject
private AudioManager mAm;
/**
* Context will be injected via RoboGuice
* @param context
*/
@Inject
public DuckingTTS(Context context) {
ttsParams = new HashMap();
ttsParams.put(TextToSpeech.Engine.KEY_PARAM_STREAM, String.valueOf(STREAM_TYPE));
ttsParams.put(TextToSpeech.Engine.KEY_PARAM_UTTERANCE_ID, "ID");
mTts = new TextToSpeech(context, ttsOnInitListener);
mIsInitialized = false;
}
public boolean speak(String text) {
if (!mIsInitialized) {
if (D) Log.d(TAG, "speak(\"" + text + "\") - queued");
queue.add(text);
return false;
}
// Tell it to speak
if (D) Log.d(TAG, "speak(\"" + text + "\") - sending to TTS");
mTts.speak(text, TextToSpeech.QUEUE_ADD, ttsParams);
if (mAm == null) return true;
// if this is the first utterance (e.g. we're not already talking) then request audio focus w/ducking.
if (mUtterancesPlaying < 1) {
int status = mAm.requestAudioFocus(audioFocus, STREAM_TYPE, AudioManager.AUDIOFOCUS_GAIN_TRANSIENT_MAY_DUCK);
if (status == AudioManager.AUDIOFOCUS_REQUEST_FAILED) {
Log.e(TAG, "speak() audio focus request failed.");
}
}
mUtterancesPlaying++;
return true;
}
public void shutdown() {
mTts.shutdown();
}
private TextToSpeech.OnInitListener ttsOnInitListener =
new TextToSpeech.OnInitListener() {
/**
* Callback for when the TextToSpeech engine was initialized.
* Result will tell us whether this was successful or not.
*
* @param status
*/
public void onInit(int status) {
if (status != TextToSpeech.SUCCESS) {
return; // we just abort on failure, it's never fully initialized
// this can be bad, by the way, because every speak() call will now add something to the queue.
}
mIsInitialized = true;
mTts.setOnUtteranceCompletedListener(ttsOnUtteranceCompletedListener);
// Also speak anything that was queued up so far.
for (String text : queue) {
speak(text);
}
}
};
private TextToSpeech.OnUtteranceCompletedListener ttsOnUtteranceCompletedListener =
new TextToSpeech.OnUtteranceCompletedListener() {
/**
* Callback when TTS has completed an utterance.
*/
public void onUtteranceCompleted(String utteranceId) {
if (D) Log.d(TAG, "onUtteranceCompleted(\"" + utteranceId + "\")");
mUtterancesPlaying--;
if (mAm == null) return;
// once we're done speaking, lose audio focus.
if (mUtterancesPlaying < 1) {
mUtterancesPlaying = 0;
mAm.abandonAudioFocus(audioFocus);
}
}
};
private AudioManager.OnAudioFocusChangeListener audioFocus =
new AudioManager.OnAudioFocusChangeListener() {
public void onAudioFocusChange(int focusChange) {
// I don't think we actually care.
if (D) Log.d(TAG, "onAudioFocusChange(" + focusChange + ")");
}
};
}
That's it. I use it in a service, and I wire it in with a simple annotation using RoboGuice:
public class GpsService extends RoboService {
@Inject
private DuckingTTS mDuckingTTS;
// ... rest of the class implementation ...
}
As the final disclaimer: The class works for me and my purposes at the moment, but it doesn't handle every error scenario. Also, don't forget to call shutdown() to release the TTS resources.
Print This Post
Android: Getting started with RoboGuice 2.0 (beta 3)
When I started messing around with Android, it consisted mostly of copying and pasting example code together to quickly get some results. That works, but the unfortunate side effect is that the Activity or Service class balloons out with functionality and features that are better off encapsulated according to proper object oriented concepts and best practices and what not.
However, once I started more time modeling classes I realized that there are an aweful lot of cases where you'll need to pass around contexts in order to get access to service providers. AudioManager for TextToSpeech, LocationManager for GPS, SensorManager for Accelerometer information, PowerManager for wake locks, just about anything worth doing required accessing a service provider. So as I started encapsulating functionality in classes I wasn't sure how to best go about initializing them. Do I keep passing around the context via the constructor, and provide setters and getters to inject mock services for unit testing? Do I use factories?
Luckily, I stumbled across RoboGuice, which extends Google's Guice dependency injection framework. Although the current "production ready" version of RoboGuice is at 1.1 (and uses Guice 2.0), RoboGuice version 2.0 uses Guice 3.0, and was simpler to set up — because it doesn't need a custom Application class. I'm all about simplicity (everything should be made as simple as possible, but not simpler).
Quick note here, I'm writing this from the point of retrofitting it to an already existing application. So I already have my application set up. I just want to take advantage of RoboGuice now to simplify it a bit.
Why?
Alright, so a good first question is, what the heck is RoboGuice, and why do I want to use it?
Essentially, RoboGuice is a dependency injection framework and allows for inversion of control. That's almost saying the same thing in two different ways. If that doesn't tell you anything, you should read up on those concepts. It would be silly for me to explain it here, since there are far better resources for that out there. In a nutshell, it helps streamline how objects are wired together by convention and configuration, allows for better separation of concerns, and, a very first and easy benefit to grasp, it reduces the amount of boiler plate code that needs to be written.
Take a look at A Simple Example that RoboGuice provides.
So right off the bat, their example shows that it's dead simple to wire up view object and system service providers, simply using annotations.
How?
As I said, I'm not even bothering with RoboGuice 1.1. Upgrading to RoboGuice 2.0 is explained on the RoboGuice Wiki. If you're completely new to it, however, it can be a bit overwhelming. To start from scratch, you need the following:
- Download the latest RoboGuice 2 snapshot. Currently, that's version 2.0b3. You'll want to drop this into your projects "/libs/" directory, which is where most other JAR files go as well if you use any (e.g. the fragments support backport android-support-v4.jar, or the Google Maps maps.jar, etc.)
- Download Guice 3.0. You want the guice-3.0-no_aop.jar. Again, this goes into the "/libs/" directory of your project.
- Not immediately obvious is that you'll also want to grab the guice-3.0.zip, because you need the
javax.inject.jarfrom it. Yes, also goes into the "/libs/" directory. - The JARs need to be added to your project, so in Eclipse, go to the Project menu, Properties, Java Build Path, Libraries tab, now "Add JARs", and add all three JARs (guice-3.0-no_aop.jar, javax.inject.jar, and roboguice-2.0b3.jar).
Okay, now your project has RoboGuice added to it, but nothing is using it yet.
Putting it to use
One of the first things you'll want to do is go into one of your application's existing Activities. If you're doing it the simple / old way your class probably just extends Activity. Just change it so it extends RoboActivity instead. If you're using fragments and your activity is a FragmentActivity class, just change it to be RoboFragmentActivity. If you're using any services, and have a class that extends Service, modify the class to extend RoboService instead.
Then go through your onCreate() methods, rip out the findViewById() calls, and replace them with @InjectView annotations in front of your property declarations, it's easy to just check A Simple Example for reference again.
Instead of a setContentView() call in onCreate(), you can use the @ContentView(R.layout.layoutname) annotation right before your class definition.
For example:
@ContentView(R.layout.record)
public class RecordActivity extends RoboFragmentActivity
{
@InjectView(R.id.txtDistance) TextView txtDistance;
@InjectView(R.id.txtTime) TextView txtTime;
@InjectView(R.id.txtPace) TextView txtPace;
@InjectView(R.id.btnStart) Button btnStart;
}
I hope that helps you get started quickly and painlessly.
Print This Post
Let's try this again
Well, 2011 turned out to be a year of few blog entries. I tend to fall into a pattern where I go "oh yeah, this time I'll totally blog consistently", but that excitement dies down quickly. I think it's because much of what I do is what I'd describe as dabbling. I'll pick a technology, play with it, and prototype something. And then there are typically three outcomes; I get bored with it, my prototype is good enough for me (but in my opinion not worth showing off), or I realize that I lack the resources to bring the project to completion, and so I move on. Some projects are forgotten, others might get revisited later.
At the moment I'm dabbling with an Android project, and for now both personal interest and technical feasibility allow me to push this forward to see where things go. I'm hoping that the one or other component or aspect is something I can share here, because otherwise I should just give up the blog thing.
Print This Post
Finding Duplicate or Similar Images with Perceptual Hashing in PHP
A while ago on Reddit someone asked how Tineye works. It's pretty fascinating; you upload a photo (or point it at a URL of an image) and it'll find other locations with similar images — if they've been indexed. Even if those images are in different sizes, or have had minor changes made to them, be it due to compression or because someone added or removed some text. So in a way, it's a fuzzy image based search engine.
Although I'm sure tineye has it's own set of algorithms and custom applications to drive all this, something similar, if crude in implementation, can be achieved with available software.
At the center of all this is an image hashing algorithm. Usually (cryptographic) hashes are designed to detect even the slightest modifications and return a completely different hash. We're looking for the opposite, and libphash delivers:
The phash library implements a "perceptual hashing" algorithm. From their site:
A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar.
So the idea is that you feed it an image, and it'll return a hash, and the similar two hashes are to one another, the more identical the images are. The comparison of similarity is done by calculating the hamming distance, which in a way is the bit-level version of the levenstein() function that PHP developers may already be familiar with.
Even better, the phash library comes with PHP bindings, and provides a few functions to get you started. For instance, there's the ph_image_hash() function. Simply give it a filename of an image, and it'll return, well, a resource. Now this really puts a damper on the usefulness of that function, since resources are fairly opaque and hard if not impossible to work with.
Fear not, I've made a few changes so that ph_image_hash() returns a plain string with the hexadecimal representation of the hash, which can then be stored in a database, for instance. You can grab phash from my github repository.
Alright, you've got a way to get to those hashes, now how does one index them and look them up in a speedy way? Well, this is where it gets interesting, and unfortunately a little bit theoretical.
Ideally, you'll store these hashes (and other meta data) in a database, but not a SQL database. You really want a vantage point tree, or better, multiple vantage point tree. Essentially these are binary trees that build up based on the distance of hashes. The idea is that hashes that are similar are close together. So you traverse the tree in order to get close to your match, and then just "look around" in that area and you'll likely find similar results.
The MVP tree area gets pretty academic, and from what I've looked at so far, most of it is theoretical, presented as limited in applicability, but at the same time seems to be exactly what such a fuzzy image search engine would need to be based on. I'm fairly certain companies working on various aspects of image recognition and augmented reality and what not are all messing with this sort of thing, so there's likely very little incentive for them to publicize or advertise their algorithms.
The phash library does include a "MVPTree" library with basic examples of how data is stored. Having something like this built out into a scalable data store with an HTTP interface a la SOLR would be fantastic.
I'd immediately work on a PHP application to index my photos, detect duplicates, etc.
Print This Post