Computer Vision L6 -- Object Tracking
How do we track object over time?

For example, the greener the car is saturated, it means the car is moving faster in the direction of down-left (reference is the color map on the right)
However, our perception of motion is often an illusion, for example:



If someone remove the color of the right frame, can we still tell what color would the balloon be? Yes it is still red, we know this because we know it is the same object, the same track, we know there is some correspondence between balloon pixels on the left with right.
It turns out that this basic idea is a signal that we can use to machine learning problem, where has to fill the colors for video, and by doing that, it learns to track automatically. The mechanism for tracking is going to emerge inside of this model.
We basically make this model watch a bunch of videos where it gets to see one frame with color and the rest of the frames don't have have any color (to be grayscaled). And we train with the model, predict all of the colors for the grayscale video. What you can then show is internally it learns to track things automatically.
We just take arbitrary videos as input, there's a natural coherence to all of the colors. Like when you are wearing a short in the video, normally the color of the shirt cannot suddenly change.

We have this special color encoding to represent different color with different light tensity. Channel a and b determine color information with some fixed lightness L. We don't need to care about the details of how it works.


Once we know about Lab Color Space, it would be easy to train the neural net to do colorization, because our input will be L and the output will just be values of a and b for each pixel.

Why does our output seems to be less colorful to the ground truth image?
Because it is a result of average. There are many possible colors for the bird, so what we output is the average of all possible a,b values:

(the bird could be red or blue, but the average point make the color more ugly brown)
One way to around this is: rather than working on a regression problem (regress the value of a and b), we can try to transform this into a classification problem and estimate the distribution of colors. So we can split up the space basically into the discrete bin.


What we're doing is for each pixel we're trying to classify which kind of discrete color belongs to that pixel.

Rarely it is correct, but this is an inverse problem. There's not a correct/unique solution here.
 
The funny thing here is that it prints the color of the red around the dog's mouth like the tougue is sticking out. It shows that the model learns bias from the dataset.


In the video problem, instead of asking what color is the shirt or something, we actually want to know where should we copy the color from?
The solution is by tracking. Learning correspondence.

We try to run a pointer to point somewhere in the image, and when we copy and paste the color of corresponding place back, it will get a correct color.

i, j are representing location in the image, f is the embedding vector. For fi and fj, we can calculate the similarity between them and create the similarity matrix A. We're going to learn those embedding so that this correspondence is going to emerge.
,we want 
One way that we instantiate the A matrix is by using softmax(), so each row this matrix is gonna sum to 1.
The prediction for the color in location j, which is 
, would be the sum of all the colors in reference color map, and then weighted by the similarity matrix Aij.
Now we can write down this optimization problem, which is basically saying find the parameters of the neural network f, such that when we compute the similarity between Fi and Fj with matrix A, the weighted sum is going to recover the correct color. (Cj is ground truth, 
 is prediction)
(we're training a neural net such that the distance function A is going to be able to copy and paste the color)
- How far/ How much time away could the reference frame and the input frame to be? This is the hyper-parameter that you can pick, basically, you want to be close, but not to too close.
 
In all, we want to find the arrow/ correspondence, e.g. we want to be sure that a flower on the dress Fj in input image is the same as the flower on the dress Fi in the reference image.

And even though it's trained for this colorization task, we can basically repurpose this full model to propagate a mask of an object. The picture above shows a mask outlining where the person is. Basically we're reusing the same equation we saw before. We change the meaning of the c vector no longer means color, it's going to be a segmentation mask.


