Rabies, Laziness & Privilege

It was the night of the recent 5-state assembly elections. One of my company's clients is a major news channel, and I was at the studio late into the night, until the election commission announced that they had cancelled their press conference which was supposed to make an announcement about the final vote counts in Madhya Pradesh. A colleague offered to drop me home, and I got off at the gate of my colony, not wanting to subject my colleague to navigating the labyrinth that is any gated colony in South Delhi. I made my way home, humming to myself, thinking about what the elections portended about 2019 - as any dutiful Indian would. (Everyone in India is a political Pundit. Say 'politics' to the nearest Indian and you'll know what I mean.)

I had in mind the images of certain Indian politicians and how they'd act all nonchalant the following day. The inside of my head might have looked like a typical prime-time TV debate screenshot, with everyone yelling at each other and no one making sense. Too much TV or even YouTube does that to you. You can't imagine a civilized discussion between people, especially if they are clad in saffron or green. The whole collage almost resembled a Tricolor, with the Ashok Chakra representing the debate's all-encompassing sense of rubbish.

It's not healthy to dwell on such images. That is at least one reason why I am thankful for what happened next. The image was shattered when, right in the middle of a stride, I felt a sharp pain in the back of my right knee. I instantly yelled out a couple of obscenities. When I looked behind me, I saw a dog a few feet away, looking as nonchalant as a certain saffron-clad spokesperson I had seen a few hours earlier, behaving as if nothing out of the ordinary had happened that day.

The shock of the dog-bite perhaps made me a little numb and I stood there for a moment, just staring at the dog. It stared me down intently, stealthy little bastard. I was wondering if I'd invaded its territory or stepped on it's tail. I don't think I did. I normally stare a few feet ahead of me on the ground as I walk, especially at night in poorly lit streets. I was less than a hundred yards from home, so I just kept walking. I hoped that it was just a scratch. My wife, Aditi, was up waiting for me. When I looked at the wound I found that the dog had managed to sink three canines right through my jeans well into my skin. One of them was bleeding pretty bad. I wondered why I didn't feel it. Perhaps the shock didn't let me. Well, shit happens, I thought.

I was tired and in a little pain. All I wanted was to wash the wound, go to sleep, and worry about this problem tomorrow. That's what I do to my most pressing problems - first aid, sleep, deal with it tomorrow. Only Aditi wouldn't let me. She insisted that I keep washing and rewashing the wound for a good twenty minutes, and that I dab it with antiseptic. I didn't see what the big deal was. To me, it was just a bloody wound, like any other. Not to be too dramatic, but I now realize that Aditi probably saved my life. The possibility of getting rabies hadn't remotely occured to me. After all, who gets fucking rabies?

'Other people', that's who.

The next morning, I looked up rabies on Google. I did that while I was making coffee. After all I can't be expected to start the day without coffee, can I? Rabies can wait. And here was the first shocker. Apparently since 1985, 25,000 to 30,000 Indians die every year from rabies. Even malaria and dengue are rarer, clocking in at nearly 23,000 and 20,000 annual deaths respectively (source). Still, the extent of my laziness is such that even this did not shock me enough to do anything about it. Now I knew that a rabies diagnosis is practically a death sentence. There's no cure and no treatment. If the symptoms show up, you're dead in less than a week. But, after all, what were the chances that that particular dog was rabid? After a bit of Aditi's life-saving nagging, I got an appointment at a local hospital. The doctor I saw told me that since the attack was unprovoked, there's a good chance the dog might have been already mad. This was the first time in the entire episode that I felt something close to fear.

I took two shots - tetanus, and one of a series of rabies vaccines. (The fourteen injections to the stomach story isn't true, by the way. If it's any consolation, you need at best ten injections.) Apart from these two shots, you need something called Human Rabies Immunoglobin (HRIG), which has to be administered within 24 hours of the bite. The doctor warned me that these were expensive, but critical. Well, I thought, jaan hai toh jahaan hai. The pharmacist told me he'd try to find the HRIG and call me. I went home and waited for the call. An hour later the pharmacist told me they couldn't find it. I called every pharmacy in the neighbourhood to no avail. Not many people get bitten by strays in South Delhi, apparently. Then where do those 30,000 dead people come from?

'Other places', that's where.

Here, I panicked, and decided to call in the cavalry - my father, a physician himself, his medico friends and some of my own medico friends. I've enjoyed the privileges of being born to a doctor all my life. The only problem was that my father lives in Nagpur, I was in Delhi, and there were roughly ten hours left of the 24 hour deadline. Even if he did manage to locate the HRIG somehow, it was extremely unlikely that I'd manage to end up in Nagpur in less than ten hours. A friend of my Dad told me that pharmacies near big government hospitals usually have HRIG in stock. So I set off to AIIMS. (Not before waiting another hour and half to squeeze in a presentation draft which needed to be made yesterday. People needing things 'yesterday' is so common, I'm surprised no one has considered seriously working on time travel.)

Every single chemist near AIIMS took a full ten minutes to determine that they didn't have the HRIG. But, marta kya na karta, I made sure I had thoroughly combed the area. Now nearly six hours from the deadline, I started to realize that I was trembling a bit, and I wasn't thinking straight.

While I was standing near AIIMS, wondering where to take my next Uber, a friend of mine, Uday, sent me a link to an HRIG alternative. He told me that he'd spoken to a pharmacist in Connaught Place, who had confirmed he had that specific drug, all ten milliliters of it. In a few minutes, Dad sent me the same link. I went to Connaught place, bought the medicine and went back to the local hospital where I'd received my earlier shots.

By this time I knew that HRIG is supposed to be injected in the wound area. The remainder, if any, can be injected anywhere else. The wound area by now was so swollen that I couldn't bend my knee. I had been limping the whole day. The doctor warned me it would be painful. True enough, the three long, hard shots I received right in my wounds were far far more painful than the canines that had created those wounds. I was yelling and shreiking in the emergency room. Thankfully the pain subsided as quickly as it had come. The good doctor did his best to explain to me why this pain was necessary, and I kept nodding my head and smiling in embarrassment. After a final word about dressing the wound, he sent me home.

Later that night Dad asked for a picture of a wound - he wanted to get a second opinion. It turned out that the wound needed further debridement, and the next morning, he sent me to Maulana Azad Medical College. He insisted I go with someone, since the wound was deep and the procedure might leave me unable to walk properly for a while. My friend Prashant was kind enough to accompany me throughout. Bro, I really owe you one. Dad had a colleague there who would hook me up. As it turns out, the wound was fine and didn't need anything except regular cleaning. I came home with a small voice in my head lamenting that it was a wasted trip, but with a much louder voice saying, 'shut the fuck up, this was important!'

I'm fine now. There's not much pain or swelling. All that's left is for me to take the periodical vaccine shots spread over a month.

I kept reading all news reports I could find on dog bites. The WHO says that over 59,000 people die of rabies every year. That means almost every other rabies victim is an Indian. There may be some sampling bias, of course. But even in this little cranny of rabies and dog bite related deaths, things are clearly very bleak. Inadequate treatment is what kills dog bite patients. You take the vaccine but ignore the wound, an infection will kill you. You ignore the tetanus shots, the tetanus will kill you. You take care of the wound and take vaccines - turns out that by the time the vaccines start acting, you're already infected. The biggest killer is the scarcity of HRIG. There's an equine variant of HRIG which is harvested from infected horses. There's a human variant which is harvested from tissue cultures (infected humans don't last too long for harvestation, I guess). It's very expensive and not well stocked. Even I - with all my money, all the access to good hospitals and doctors, with all the privileges of being in a city like Delhi - had a harrowing time of it. I can't even begin to imagine what happens to less privileged patients.

Secondly, my laziness is the stuff of legend among my friends and family. Even a death scare couldn't completely cure me of it. I hate to think how much damage it has done over the years. It's not pretty, this realization.

If there's one thing you take away from reading this, let it be this:

Other people is you.

Other places is home.

Weighted Loss Functions for Instance Segmentation

This post is a follow up to my talk, Practical Image Classification & Object Detection at PyData Delhi 2018. You can watch the talk here,

and see the slides here:

I spoke at length about the different kinds of problems in computer vision and how they are interpreted in deep learning architectures. I spent a fair bit of time on instance and semantic segmentation (for an introduction to these problems, watch Justin Johnson's lecture from the Stanford CS231 course here). In short, semantic segmentation deals with designating each pixel of an image as belonging to a class, and instance segmentation deals with identifying which instance of a class each pixel belongs to. The following images show an example of semantic versus instance segmentation.

At the outset, a semantic segmentation output can be converted to an instance segmentation output by detecting boundaries and labeling each enclosing object individually. Semantic segmentation is essentially a classification problem that is applied at each pixel of and image, and can be evaluated with any suitable classification metric. A useful metric to evaluate how capable a model is of learning the boundaries that are required for instance segmentation is called mAP of IoU - mean average precision of the intersection over union. This metric is designed specifically to evaluate instance segmentation performance. Here's a brief explanation of how it works.

Imagine that we're solving a binary semantic segmentation problem, where the task is to simply predict if a pixel belongs to the background or foreground. We first create, as the ground truth, an image with two circular objects in it. Note that the circles, at the point where they are closest to each other, are separated by very few pixels.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from skimage import draw
%matplotlib inline

x = np.zeros((256, 256))
cy, cx = 80, 80
radius = 64
cy2 = cx2 = (np.sqrt(cy **2 + cx **2) + 2 * radius) * np.cos(np.pi / 4) + 1
o1r, o1c = draw.circle(cy, cx, radius)
o2r, o2c = draw.circle(cy2, cx2, radius)
x[o1r, o1c] = 1
x[o2r, o2c] = 1
plt.figure(figsize=(5, 5))
_ = plt.title('Ground Truth', fontsize=20)
  Click me to hide the output

Now imagine that we have a model that is good at semantic segmentation, but not so much at instance segmentation. So it (somewhat) wrongly predicts a labeled output that is almost identical to the ground truth, except it fills the gap that separates the two circles. After all, the circles are almost touching anyway.

In [2]:
from skimage.morphology import binary_dilation

prediction = binary_dilation(x)
plt.figure(figsize=(5, 5))
_ = plt.title('Prediction', fontsize=20)
  Click me to hide the output

Since this a binary classification problem, and we can tell that there is clear class imbalance, let's use the F1 score to evaluate our prediction.

In [3]:
from sklearn.metrics import f1_score

print(f1_score(x.ravel(), prediction.ravel()))
  Click me to hide the output

That's not bad at all. There are a very small number of pixels where the model is mistaken - the ones in the area where the circles are not supposed to be touching. Let's take a look at how well the other metric - mAP of IoU - does with this prediction. Here are a couple of functions that implement the mAP of IoU computation. Followed by a detailed explanation of the metric.

In [4]:
def iou(masks_true, masks_pred):
    Get the IOU between each predicted mask and each true mask.


    masks_true : array-like
        A 3D array of shape (n_true_masks, image_height, image_width)
    masks_pred : array-like
        A 3D array of shape (n_predicted_masks, image_height, image_width)

        A 2D array of shape (n_true_masks, n_predicted_masks), where
        the element at position (i, j) denotes the IoU between the `i`th true
        mask and the `j`th predicted mask.

    if masks_true.shape[1:] != masks_pred.shape[1:]:
        raise ValueError('Predicted masks have wrong shape!')
    n_true_masks, height, width = masks_true.shape
    n_pred_masks = masks_pred.shape[0]
    m_true = masks_true.copy().reshape(n_true_masks, height * width).T
    m_pred = masks_pred.copy().reshape(n_pred_masks, height * width)
    numerator = np.dot(m_pred, m_true)
    denominator = m_pred.sum(1).reshape(-1, 1) + m_true.sum(0).reshape(1, -1)
    return numerator / (denominator - numerator)

def evaluate_image(masks_true, masks_pred, thresholds):
    Get the average precision for the true and predicted masks of a single image,
    averaged over a set of thresholds

    masks_true : array-like
        A 3D array of shape (n_true_masks, image_height, image_width)
    masks_pred : array-like
        A 3D array of shape (n_predicted_masks, image_height, image_width)

        The mean average precision of intersection over union between
        all pairs of true and predicted region masks.

    int_o_un = iou(masks_true, masks_pred)
    benched = int_o_un > thresholds
    tp = benched.sum(-1).sum(-1)  # noqa
    fp = (benched.sum(2) == 0).sum(1)
    fn = (benched.sum(1) == 0).sum(1)
    return np.mean(tp / (tp + fp + fn))
  Click me to hide the output

Note that the IoU between any pair of binary masks can be any real number in $[0, 1]$. Therefore, it is necessary to apply a threshold the IoUs between all pairs of predicted and true masks, to get a meaningful evaluation. The convention used by many Kaggle competitions is to have a set of thresholds from 0.5 to 0.95 in steps of 0.05. In the following cell, we create this set of thresholds and use them to evaluate the metric.

In [5]:
# make the thresholds
THRESHOLDS = np.arange(0.5, 1, 0.05).reshape(10, 1, 1)

# segment the ground truth image into constituent masks
bg = np.zeros(x.shape)
true_mask_1 = bg.copy()
true_mask_1[o1r, o1c] = 1
true_mask_2 = bg.copy()
true_mask_2[o2r, o2c] = 1
y_true = np.array([true_mask_1, true_mask_2])

# reshape the prediction matrix to fit the format required by the `evaluate_image` function
y_pred = prediction.reshape((1,) + prediction.shape)

map_iou = evaluate_image(y_true, y_pred, THRESHOLDS)
  Click me to hide the output

That was by design. On careful inspection, it is apparent that both of the true masks account for less than half the area occupied by the predicted mask. Thus, the predicted mask has in IoU of less than 0.5 with each true mask. Since we start thresholding the IoU values at 0.5, the prediction did not register a true positive with either of the true masks - ultimately leading to a score of zero.

Let's try another hacked prediction, where we create two circular objects again, but this time they share the same centers as their ground truth counterparts, and have a radius that is less than the true radius by six units. Then, let's evaluate both the metrics again on these new predictions.

In [6]:
xnew = np.zeros((256, 256))
cy, cx = 80, 80
radius_new = 58
cy2 = cx2 = (np.sqrt(cy **2 + cx **2) + 2 * 64) * np.cos(np.pi / 4) + 1
o1r, o1c = draw.circle(cy, cx, radius_new)
o2r, o2c = draw.circle(cy2, cx2, radius_new)
xnew[o1r, o1c] = 1
xnew[o2r, o2c] = 1
plt.figure(figsize=(5, 5))
plt.imshow(xnew, cmap=plt.cm.gray)
_ = plt.title('Another Prediction', fontsize=20)
  Click me to hide the output
In [7]:
# segment the predicted image into constituent masks
bg = np.zeros(xnew.shape)
predicted_mask_1 = bg.copy()
predicted_mask_1[o1r, o1c] = 1
predicted_mask_2 = bg.copy()
predicted_mask_2[o2r, o2c] = 1
y_pred = np.array([predicted_mask_1, predicted_mask_2])

fscore = f1_score(y_true.sum(0).ravel(), y_pred.sum(0).ravel())
print("F1 Score: {}".format(fscore))

map_iou = evaluate_image(y_true, y_pred, THRESHOLDS)
print("MAP of IoU: {}".format(map_iou))
  Click me to hide the output
F1 Score: 0.9014126584439418
MAP of IoU: 0.7

The pixelwise accuracy (the F1 score) has taken a hit, since the prediction has lost two circular rings surrounding the true masks, but is still not bad. However, the IoU has shot up disproportionately! This shows that the MAP of IoU penalizes incorrect region separation a lot more than it rewards pixelwise correctness.

As a more general example, suppose we have $K$ true masks, and $L$ predicted masks. Then, in order to calculate the MAP of IoU, we construct a matrix $I \in \mathbb R^{K \times L}$ as follows:

$p_{1}$ $p_{2}$ $p_{3}$ $…$ $p_{l}$
$a_{1}$ $O_{11}$ $O_{12}$ $O_{13}$ $…$ $O_{1L}$
$a_{2}$ $O_{21}$ $O_{22}$ $O_{23}$ $…$ $O_{2L}$
$a_{3}$ $O_{31}$ $O_{32}$ $O_{33}$ $…$ $O_{3L}$
$a_K$ $O_{K1}$ $O_{K2}$ $O_{K3}$ $…$ $O_{Kl}$

where $O_{k, l}$ is the IoU of the $k$th true mask and the $l$th predicted mask. This matrix is whatt the iou function written above generates. Given our set of thresholds, $\Theta = \{0.5, 0.55, 0.6, ... 0.9, 0.95\}$, we can filter $I$ with all of them one by one. At each threshold $\theta_{i} \in \Theta$, we calculate a boolean matrix $I_{\theta_{i}} = I > \theta_{i}$. Using this matrix, we compute the following values:

  • $t^{+}(\theta_{i})$: The number of true positives - the number of predicted masks that found a match with a true mask. This is equal to the number of rows in $I_{\theta_{i}}$ that have at least one positive value.
  • $f^{+}(\theta_{i})$: The number of false positives - the number of predicted masks that found no match with a true mask. This is equal to the number of rows in $I_{\theta_{i}}$ that have no positive value.
  • $f^{-}(\theta_{i})$: The number of false negatives - the number of true masks that found no predicted match. This is equal to the number of columns in $I_{\theta_{i}}$ that have no positive value.

With these values computed for all thresholds $\theta_{i} \in \Theta$, we finally calculate the MAP of IoU as follows:

$$ \cfrac{1}{|\Theta|}\sum_{\theta_{i} \in \Theta}\cfrac{t^{+}(\theta_{i})}{t^{+}(\theta_{i}) + f^{+}(\theta_{i}) + f^{-}(\theta_{i})}$$
This measure is what the evaluate_image function written above calculates.

A popular neural network architecture to perform semantic / instance segmentation is the UNet:
It puts together the best properties of a network that are useful for pixel segmentation:

  1. It is fully convolutional
  2. It doesn't suffer because of the size of the image
  3. It incorporates learnable upsampling

A Keras implementation of a typical UNet is provided here. This model can be compiled and trained as usual, with a suitable optimizer and loss. For semantic segmentation, the obvious choice is the categorical crossentropy loss. For instance segmentation, however, as we have demonstrated, pixelwise accuracy is not enough, and the model must learn the separation between nearby objects. Normally such separation can be done with morphological operations on the images, but these operations cannot easily be made a part of the learning of the model. So the alternative is to force the network to learn region separations in an entirely data-driven manner. The UNet paper provides an interesting way of doing this - introducing pre-computed weight maps into the loss function which penalizes the loss near the boundaries of regions more than elsewhere. These weight maps are calculated as follows:

$$ w(\mathbf{x}) = w_{c}(\mathbf{x}) + w_{0} \times exp \Biggl( -\frac{(d_{1}(\mathbf{x}) + d_{2}(\mathbf{x}))^2}{2\sigma^2} \Biggr)$$
Here, $w_{c}$, $d_{1}$ and $d_{2}$ are all functions over a two dimensional image such that:

  • $w_{c}: \mathbb R^{m\times n} \rightarrow \mathbb R^{m\times n}$ is the class probability map.
  • $d_{1}: \mathbb R^{m\times n} \rightarrow \mathbb R^{m\times n}$ is the distance to the border of the nearest cell,
  • $d_{2}: \mathbb R^{m\times n} \rightarrow \mathbb R^{m\times n}$ is the distance to the border of the second nearest cell.

A vectorized implementation of $w$ is provided below in the make_weight_map function.

In [8]:
from skimage.segmentation import find_boundaries

w0 = 10
sigma = 5

def make_weight_map(masks):
    Generate the weight maps as specified in the UNet paper
    for a set of binary masks.
    masks: array-like
        A 3D array of shape (n_masks, image_height, image_width),
        where each slice of the matrix along the 0th axis represents one binary mask.

        A 2D array of shape (image_height, image_width)
    nrows, ncols = masks.shape[1:]
    masks = (masks > 0).astype(int)
    distMap = np.zeros((nrows * ncols, masks.shape[0]))
    X1, Y1 = np.meshgrid(np.arange(nrows), np.arange(ncols))
    X1, Y1 = np.c_[X1.ravel(), Y1.ravel()].T
    for i, mask in enumerate(masks):
        # find the boundary of each mask,
        # compute the distance of each pixel from this boundary
        bounds = find_boundaries(mask, mode='inner')
        X2, Y2 = np.nonzero(bounds)
        xSum = (X2.reshape(-1, 1) - X1.reshape(1, -1)) ** 2
        ySum = (Y2.reshape(-1, 1) - Y1.reshape(1, -1)) ** 2
        distMap[:, i] = np.sqrt(xSum + ySum).min(axis=0)
    ix = np.arange(distMap.shape[0])
    if distMap.shape[1] == 1:
        d1 = distMap.ravel()
        border_loss_map = w0 * np.exp((-1 * (d1) ** 2) / (2 * (sigma ** 2)))
        if distMap.shape[1] == 2:
            d1_ix, d2_ix = np.argpartition(distMap, 1, axis=1)[:, :2].T
            d1_ix, d2_ix = np.argpartition(distMap, 2, axis=1)[:, :2].T
        d1 = distMap[ix, d1_ix]
        d2 = distMap[ix, d2_ix]
        border_loss_map = w0 * np.exp((-1 * (d1 + d2) ** 2) / (2 * (sigma ** 2)))
    xBLoss = np.zeros((nrows, ncols))
    xBLoss[X1, Y1] = border_loss_map
    # class weight map
    loss = np.zeros((nrows, ncols))
    w_1 = 1 - masks.sum() / loss.size
    w_0 = 1 - w_1
    loss[masks.sum(0) == 1] = w_1
    loss[masks.sum(0) == 0] = w_0
    ZZ = xBLoss + loss
    return ZZ
  Click me to hide the output

Here is an example of how the weight map affects a set of masks. We generate three circular regions such that two of them are much closer to each other than the third one. The weight map functions magnifies the values in the region that is close to boundaries more than other values.

In [9]:
params = [(20, 16, 10), (44, 16, 10), (47, 47, 10)]
masks = np.zeros((3, 64, 64))
for i, (cx, cy, radius) in enumerate(params):
    rr, cc = draw.circle(cx, cy, radius)
    masks[i, rr, cc] = 1
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 5))
ax1.set_title('True Masks', fontsize=15)

weights = make_weight_map(masks)
pos = ax2.imshow(weights)
ax2.set_title('Weights', fontsize=15)
_ = fig.colorbar(pos, ax=ax2)
  Click me to hide the output

Application of these weight maps to the output of neural network and then finally computing the compunded loss is somewhat involved. A fully convolutional network like the UNet - since it has no dense layers at the end - needs 1 x 1 convolutions at the end to convert the convolutions from the previous layer to produce something on which categorical softmax can work. The output layer produces a volumne of size $h \times w \times K$, where $h$ and $w$ are the image height and width respectively and $K$ is the number of classes. Denoting the unactivated output of the channel correspoding to the $k$th class as $a_{k}(\mathbf{x})$, the softmax activation is computed as

$$ p_{k}(\mathbf{x}) = \cfrac{exp(a_{k}(\mathbf{x}))}{\sum_{k'=1}^{K}exp(a_{k'}(\mathbf{x}))} $$

Then, a modified cross entropy loss is calculated as
$$E = \sum w(\mathbf{x})log(p_{l(\mathbf{x)}}(\mathbf{x}))$$

where $w(\mathbf{x})$ is the weight map function and $l(\mathbf{x})$ denotes the true label of each pixel, Thus, computing $E$ amounts to:

  1. computing the softmax activation
  2. using the true labels as a mask for this activation, (note how the function $l(\mathbf{x})$ is used),
  3. summing the masked output along the dimension corresponding to the $K$ and taking it's log
  4. multiply the log output with the weight map and aggregate the result across pixels

When using Keras with a Tensorflow backend, the crossentropy loss, by default, is a manual computation of cross entropy, which doesn't allow for weighing the loss explicitly. The manual computation is necessary because the corresponding Tensorflow loss expects logits, whereas Keras losses expect probabilities.

Besides, it is clear that the loss we need isn't the usual categorical crossentropy. Notice that the weights have to be applied to the log of the activation before the pixelwise sum is evaluated. My solution to this was to incorporate the weighing of the log of the activation into the Keras model itself, and leave only the aggregation to the loss function. Thus, the model ends up doing more than just generating the final set of softmax activations. It also takes the log of the activations and applies the weights to it. Therefore, the weight maps too become a part of the computational graph of the model - unlike in the case of conventional class weights, where it can be supplied at the time of loss calculation. So now the model takes three inputs - the images, their labels and their weight maps.

Thanks to Keras' beautiful functional API, all of this amounts to adding a few non-trainable layers to the model and writing a custom loss function to mimic only the aggregation of the categorical crossentropy function.

The entire script for the model is available here, but the essence of it is as follows:

from keras.layers import Input, Conv2D, Lambda, multiply # etc
# Start making the UNET model as usual

image_input = Input(shape=(image_height, image_width, n_channels))
conv1 = Conv2D(64, 3)(image_input)

# add other layers

# Final 1 x 1 trainable conv layer that does the softmax.
softmax_op = Conv2D(n_classes, 1, activation='softmax')(previous_layer)

# Add a few non trainable layers to mimic the computation of the crossentropy
# loss, so that the actual loss function just has to peform the
# aggregation.
normalize_activation = Lambda(lambda x: x / tf.reduce_sum(x, len(x.get_shape()) - 1, True))(softmax_op)
clip_activation = Lambda(lambda x: tf.clip_by_value(x, _epsilon, 1. - _epsilon))(normalize_activation)
log_activation = Lambda(lambda x: K.log(x))(clip_activation)

# Add a new input to serve as the source for the weight maps
weight_map_ip = Input(shape=(image_height, image_width))
weighted_softmax = multiply([log_activation, weight_map_ip])

model = Model(inputs=[image_input, weight_map_ip], outputs=[weighted_softmax])

This model can now be compiled and trained as usual. Note that the three Lambda layers are identical to the operations in the keras.backend.tensorflow_backend.catgorical_crossentropy. So essentially we have swallowed most of the functionality of the cross entropy loss into the model itself. I should re-emphasize that this had to be done because the weighted loss we wanted could not have been possible with the default loss function, since the scaling of the log of the activations (with the precomputed weights) has to be done before the loss is aggregated.

I discovered the UNet paper a few months ago during the Data Science Bowl 2018 competition on Kaggle. I spent a better part of three weeks on implementing the weight maps and incorporating them into my UNet. And it paid off fairly well - I jumped 504 places on the public leaderboard. In hindsight, data augmentation would have had the same (or perhaps a better) effect, but that fact that I was able to all of this shows how beautifully Keras' (and indeed, Tensorflow's) functional API has been designed.


  1. Olaf Ronneberger, Philipp Fischer: “U-Net: Convolutional Networks for Biomedical Image Segmentation”, 2015; arXiv:1505.04597.

Playing With the Konmari Method

I heard about the bestseller The Life-Changing Magic of Tidying Up at a SciPy talk about deculttering your data science project. The speakers admitted they hadn't read it - they were simply trying to point out that tidying up your space and tidying up your software project are both similar.

Read more…

On The Linearity Of Bayesian Classifiers

In his book, Neural Networks - A Comprehensive Foundation, Simon Haykin has an entire section (3.10) dedicated to how perceptrons and Bayesian classifiers are closely related when operating in a Gaussian environment. However, it is not until the end of the section that Haykin mentions that the relation is only limited to linearity. What is interesting about this is that a Perceptron can produce the same classification "model" as a Bayesian classifier, provided that the underlying data is drawn from a Gaussian distribution. This post is an experimental verification of that.

Read more…

A Geometric Proof of the Perceptron Convergence Theorem

The last time I studied neural networks in detail was five years ago in college. I did touch upon backpropagation when Andrew Ng's machine learning MOOC was offered on Coursera for the first time, but beyond that I've only dabbled with them through keras. Then recently, when I read about Coursera's imminent decision to pull down much of their freely available material (you can read a rant about it here), I went on a downloading spree (many thanks to the wonderful coursera-dl). Of all the courses I downloaded, the one that caught my eye was Geoffrey Hinton's course on Neural Networks for Machine Learning. Because of that and the fact that there were some computer vision projects going on at work, I decided to dive right in.

Hinton's course is wonderful. He is funny, and unsurprisingly, very very insightful about the core concepts in neural networks. One of the signs of this is the fact that this course is not at all cluttered with too much mathematics, and can be traveresed by someone with only a working knowledge of calculus. One of his most insightful moments in the course is when he describes the Perceptron learning rule as simply as follows:

  • If the perceptron makes no mistake, leave it alone.
  • If it predicts a false negative, add the input vector to the weight vector
  • If it predicts a false positive, subtract the input vector from the weight vector

This is so simple, that a literal implementation of this can make train a perceptron reasonably well (as we shall see). There are of course, numerous heuristics required when applying it in production, but the training algorithm is just this simple. Now, the popularity of the perceptron is because it guarantees linear convergence, i.e. if a binary classification problem is linearly separable in the feature space, the perceptron will always eventually learn to correctly classify the input samples. An algebraic or analytical proof of this can be found anywhere, but relies almost always on the Cauchy-Schwarz inequality. I thought that since the learning rule is so simple, then there must be a way to understand the convergence theorem using nothing more than the learning rule itself, and some simple data visualization. I think I've found a reasonable explanation, which is what this post is broadly about. But first, let's see a simple demonstration of training a perceptron.

Read more…

Understanding Allen Downey's Solution to the M&M Problem

Allen Downey makes a very good case for learning advanced mathematics through programming (Check the first section of the preface of Think Bayes, titled "My theory, which is mine"). But before the reader can hit paydirt with using the Bayes theorem in programming, Downey makes you go through some elementary problems in probability, which have to be solved by hand first, if you expect to have a clear enough understanding of the concept. I can vouch for this way of learning complex concepts. The way I learnt the backpropagation algorithm (and its derivation), was with a pen, paper and a calculator.

Read more…

Evangelism in FOSS

One of the most dangerous things that can affect any FOSS community is the tendency of evangelism for the sake of evangelism. Promoting the Python stack, expanding the userbase, etc, should come only as a consequence of the content we produce as developers. If evangelism even remotely becomes one of your goals, your quality is sure to suffer. And it's not just the empirical evidence that prompts me to say this. It even makes logical sense. If we want to "promote" Python and related tech, our best market would be the young and unexperienced (and therefore non-opinionated) minds. But note that such audiences are also very fickle. They may not return for the next conference. And since they don't, we have to count on more fresh entries each year. And in the conference itself, since we're all acutely aware of the demographic, we spend too many talks pandering to this part of the audience.

Actually, I'll even go so far as to say that expanding the user base of a language is not the purpose of a PyCon. For that we've got activities going on all year. Come on! It's a three day event that happens once a year! If you're so concerned about evangelism, focus on the local chapter meetups. PyCon isn't the place to do it. It's a place for people to get together and exchange ideas. Teaching basics gets in the way of that like nothing else.

So, in short, focus on the quality. Users will follow.

Organizing a Bookshelf with Multivariate Analysis

I have recently moved from Pune to Delhi. I had spent only a year in Pune, having moved there from Mumbai, where I lived for three years. Whenever I move, the bulk of my luggage consists of books and clothes. My stay in Pune was just a transition, so I never bothered to unpack and store all my stuff too carefully. Thus, a corner of my bedroom in my Pune apartment always looked like this:

In [1]:
from IPython.display import Image
Image('images/before.jpg', height=300, width=400)
  Click me to hide the output

The actual volume of books that I carried from Pune to Delhi is about twice of what is seen in the picture. My house in Delhi has six bookshelves, three in the living room and three in another room that I've made into a study. Naturally, I wanted to arrange my nearly 200 books in a manner that would be convenient to access, and such that the arrangement made some thematic sense. For example, I should not have to go to the living room to get one of my books on data analysis, and my partner shouldn't have to keep coming into the study to get one of her novels. Also, books that deal with similar subjects should naturally lie together. Obviously, arraging books semantically and such that they look good on the shelves isn't a data analysis problem as such, and even a haphazard, arbitrary arrangement of books would still work well for most practical purposes, but I still thought making it data driven would be a fun exercise.

Read more…

Limitations of the Fourier Transform: Need for a Data Driven Approach

My implementation of the Hilbert Huang transform (PyHHT) is quite close to a beta release. After nearly three years of inactivity, I've found some time to develop the PyHHT library in the last few weeks. In all this time many people have written to me about a lot of things - from the inability to use the module because of the lack of documentation, to comparison between the results of HHT and conventional time series analysis techniques. And now the time has come when I can no longer avoid writing the documentation. But as I write the documentation, I've found that I need to re-learn the concepts which I learned back when I started writing this module. It is out of this learning cycle that came my last blog post.

This blog post is the second in a series of posts in which I will discuss HHT in detail. (After all I've always thought that blogging is a lot more fun than writing documentation. This really doesn't bode well for my software development habits, but here I am.)

Read more…

Understanding Edge Effects in Empirical Mode Decomposition

It has been a little over three years since I started working on a Python implementation of the Hilbert Huang Transform. When I first presented it at SciPy India 2011 (video) it was just a collection of small scripts, without packaging, testing or even docstrings. Over time PyHHT has garnered some interest, and I have, since the last few weeks, found the time to regularly work on it. I have been able to come up with a decent implementation of empirical mode decomposition, pretty much in line with everything described in the original paper by Huang et al. The other parts of HHT, like the Hilbert spectrum and its time frequency representations are right around the corner.

Read more…