DINOv2
foreign
testing
testing
testing
testing
testing
test testing
all right
welcome to another hoopo stream
uh I did a little Switcheroo yesterday
so originally we were going to read a
segmentation paper kind of a
continuation of the segment anything
but uh just last night or I don't know
if it was last night but yesterday they
released
uh meta AI research
released Dino V2
uh which seems more important than uh
the paper I was going to read so I
switched and we're now going to read
Dyno V2
uh
released basically uh
this week I don't know why it's S14
because I saw it for the first time on
uh Twitter yesterday
recommended by a friend
so this is version two of uh dino which
is
a unsupervised method for training kind
of
foundational computer vision models if
you can call it that I think computer
vision is getting to the point where uh
these things can be called foundational
models right we're no longer in the
world of segmentation models and
classification models and bounding box
detection models and I think that world
of having specialized architectures for
the different parts is slowly
disappearing into the mist of time and
more and more we're seeing kind of these
big giant
uh multi-task computer vision problem or
computer vision models that uh you can
apply to
a huge variety of tasks and this is the
GitHub by the way and if you look here
right the backbones that they train are
all Vision Transformers so
we're kind of seeing the supremacy of
vision Transformers right you don't
really see a lot of uh conf Nets as
encoders anymore
but these are Big right you got big
big Vision Transformers that are
supposed to be task agnostic
so look at this you can even load them
directly from python this is quite
quite good
I got the conda environment
Al requirements
yeah and the problem maybe the one
negative of this is if you look at this
right these like training uh runs that
they run they're running on 12 a100s
like this is
you know like I feel like the previous
generation of computer vision models you
could kind of maybe train them at home
or at least maybe fine-tune them at home
but nowadays these things are just so
huge you can't you can't you can't even
run these you know
but it's good you know we're kind of
moving forward and we're seeing the
beginnings of foundational models for
computer vision
they're not the beginnings but kind of
the supremacy of them
so let's get going here so
as we see these Foundation models
become the norm in the machine learning
world this is something that you're
seeing all the time as well right it's
gone are the days where a paper just has
three to four names on it right nowadays
the machine learning papers have 20
names on them because
there's a whole Squad of people a whole
team of people that are required to
train these giant Foundation models so
maybe we're going to see a change in
Just the Way Machine learning research
works and rather than reading uh 10
papers that are each kind of like
largely spearheaded by one person and
there's maybe like a handful of people
on the thing
you're going to be reading machine
learning papers uh with 20 names on them
right which is kind of what we're seeing
um so let's get started here recent
breakthroughs in natural language
processing for model pre-training in
large quantities have opened the way for
similar Foundation models in computer
vision yeah that's kind of the key word
there
these models could greatly simplify the
use of images in any system by producing
all-purpose visual features right you
want to have an encoder that
give it an image and it'll give you a
feature Vector an embedding that is
useful for any task you could want
segmentation classification bounding box
detection potentially some kind of weird
regression
features that work across image
distributions and tasks without fine
tuning
yeah you don't want to you don't need to
push any gradients into this giant
encoder
this work shows that existing
pre-training methods especially
self-supervised methods can produce such
features if trained on enough curated
data from diverse sources so curated
data from diverse sources right
the word curated there is a little bit
interesting right that means they're
doing a lot of data cleaning a lot of
data prep and this is something we saw
out of open AI as well right where they
actually have a whole team of people
internally that clean the text Data
right so if cleaning the text data is
important I think cleaning the image
data is even more important
because the distribution of kind of
image
data is
I don't I don't know if I want to say
broader and more varied than text data
but I would I would kind of make that
statement
uh we revisit existing approaches
combine different techniques and scale
our pre-training in terms of data a
model and size
so maybe one interesting thing here is
that openai is not going to tell us
about all the different tricks and
techniques that they use for
pre-training and training their large
Foundation models because they're so
afraid of competitors
that they don't want to release that but
meta Is Not Afraid right and
whatever techniques they use here uh for
training these large models on huge uh
these huge parallel setups of like
multi-gpus in these server racks
those are probably very very similar to
the techniques that openai is using to
train their llms right there's probably
the same kind of tricks so
looking at these type of papers where
they're trading these Vision Foundation
models might be a way to kind of Intuit
what openai is doing when they're
training their uh text Foundation models
so a little tip there
uh technical contributions aim at
accelerating and stabilizing the
training at scale right lots of
different types of regularization maybe
uh
evaluation kind of like weaved in there
we propose an automatic pipeline to
build a dedicated diverse and curated
image data set instead of uncurated data
as typically done in a self-supervised
literature
curation is important
in terms of the models we train a vit
model with 1 billion parameters this is
a huge Vision Transformer
this is no joke like you probably can't
even fit this on your consumer GPU and
distill it into a series of smaller
models right
so this is kind of interesting I feel
like
maybe previously what you would have
seen is that they would have traded
multiple smaller or multiple models of
different sizes but now
kind of distillation has gotten to the
point where it's pretty good and there's
a good set of Tricks associated with
model distillation model distillation is
where you take a larger model and then
you train a smaller model to basically
mimic the bigger model right you give
something to the bigger model the bigger
model produces the output and then you
say Okay small model here is the input
to the big model and the output to the
big model just copy that right
so if you're trying to save on a bunch
of training and compute budget this
actually seems like a very good
technique right train the very very big
model with your huge data set with all
your tricks all the like kind of
regularization all that stuff and then
take that big model and then from it
distill a bunch of smaller models
which is actually probably
much more common that we think it is I
think probably open AI does this as well
right I bet you that whenever you're
using chat gbt right you're not actually
using the large chat GPT model you're
using some kind of distilled model that
has been uh kind of chosen to fit
exactly into whatever one inference GPU
that like
is designed for user requests right
because the one billion parameter model
is just that's going to be too huge you
having that in a in a kind of like a
model serving context where people can
like send uh requests to that model it's
you you'd be very complicated to have
that available
but these little tiny models that you
can fit on one GPU right that's a lot
easier to do
open clip
uh-oh is this the end of clip no because
this isn't text right
learning task agnostic pre-trained
representations have become the standard
in natural language processing
one can use these features as they are
IE without fine tuning
this is key here right
and Achieve performance on Downstream
tasks that are significantly better than
those produced by task specific models
the success has been fueled by
pre-training a large quantities of raw
tax using pre-tax objectives
such as language modeling or word
vectors that require no supervision
and the word raw here is kind of
misleading right I think part of what
they're going to talk about in this
paper and I think it's going to be a
huge theme is the curation right where
you can just train on huge raw data sets
that you just scraped off the internet
but you the quality of that of those
images is just not going to be good
enough and you you want to curate that
data set so
I suspect that a huge section of this is
going to be the curation
uh following this Paradigm Shift we
expect similar Foundation models
to appear in computer vision
these models should generate visual
features that work on the box
work out of the box on any task
most promising efforts towards these
foundational models focus on text guided
pre-training
so this is what clip does right clip has
both text and image features are being
projected into the same kind of
embedding space
but it doesn't seem like that's what
they're going to be doing here I think
this is an image only model
this form of text guided pre-training
limits the information that can be
retained about the image since captions
only approximate the rich information
and images and complex pixel level
information
may not surface with this supervision
this is kind of interesting they're
saying that there's some limit
to uh text and image modalities if you
train on both text and image which is
what clip does they're saying that
you're going to lose out on some of the
signal that you're going to get right
specifically this pixel level
information
that's kind of cool right I feel like
that's also contrarian right most people
would say that hey if you have text and
image it's better to trade on both
modalities and the features you get out
of that are going to be more rich but
here you have people saying that no the
text is just a distraction you're better
off just training purely on the image
we compute a PCA PCA is a principal
component analysis it's a kind of it's a
form of dimensionality reduction uh I
would call it like a classic machine
learning algorithm and it's a way to
take kind of like a high dimensional uh
vector
space I guess and compress it into
usually three dimensions so that you can
visualize it so
uh
here we go
so if you have some data set like this
right you can identify the principal
component right and the principal
components are going to be defined by
the some kind of dimension in which you
have High variance or low variance right
so here
this kind of smear of data right you can
say okay well the most variants exist
kind of along this axis and then this
axis so these two must be the the two
principal components of it right and you
can you're not limited to like two
Dimensions or three dimensions you can
kind of do it in any amount of
dimensions
foreign
show the first three components each
component is matched to a different
color Channel
same parts are matched between related
images despite changes of pose style or
even objects
huh
so this is actually really cool so when
you look at this it kind of looks like
some kind of segmentation right like a
pose detector model right like these uh
pose Nets dense pose
right this is like work that people did
for a while where it's like it's
basically
segments out the head and the body and
the legs as different parts but the
interesting thing here is that this
isn't that right this is just PCA on the
features right you they fed these images
into their Giant image encoder they get
a vector of features they do PCA on
those features and then it turns out
that the representation of the elephant
head is
the same kind of thing as the Eagles
Wings right and I guess not here here
they're showing you that all four of
these elephants right which one of them
isn't even a picture of an elephant it's
like a picture of a statue of an
elephant the head of the elephant
is the same right you see this green
color on all of it so it's almost like
it's implicitly learned this notion of
different parts of the animal which is
kind of crazy actually look at
here this one's even more impressive
right look at this there's this is like
an overhead shot
of a bunch of horses on a field
and each of them is
has the same exact kind of uh coloring
as the individual pictures of horses
which is crazy right because these are
so tiny
so we can understand scale quite well as
well
that's cool
an alternative to text guided
pre-training is self-supervised learning
where features are learned from images
alone these approaches are conceptually
closer to pretext tasks such as language
modeling and can capture information at
the image and pixel level
however despite their potential to learn
all-purpose features most of these
advances in self-supervised learning
were made in the context of pre-training
on small curated data sets imaged at 1K
right imagenet good old imagenet 1K
right in the 1K there refers to 1 000 uh
classes it's a classification model
some efforts on scaling these approaches
have been attempted but they focused on
uncurated data sets which typically lead
to a significant drop in quality
right uncurated significant drop in
quality
this is explained by the lack of control
over the data quality and diversity
which are essential to produce good
features
I agree with this but I also think that
this is a phase right I think that we're
currently in a phase where training on a
very very large curated data set is
better than training on a slightly
larger uncurated data set but I feel
like the
the rich Sutton kind of bitter lesson is
going to come back and what we're going
to see is that
I feel like in the future the most the
biggest possible uncurated data set is
actually going to be the best way to do
it so I think this curating the data
sets is working right now for this
particular uh generation of foundation
models but I suspect that in the future
the scale will just beat out
and maybe there's some ways that you
could basically weigh you could like
take the image and then figure out
whether it's a high quality image or a
low quality image with a model itself
and then basically do some kind of like
pseudo-labeling in that way
we explore if self-supervised learning
has the potential to learn all-purpose
visual features if pre-trained a large
quality we revisit existing
discriminatives self-supervised
approaches
okay
that learn features about the image and
Patch level both the image and Patch
level so these are Vision Transformers
right so they cut up the image into
these little patches
foreign we reconsider some of the design
choices under the lens of a larger data
set most of our technical contributions
are tailored towards stabilizing and
accelerating discriminative
self-supervised learning
so basically it's all these bags of
tricks that you use whenever you're
training these huge models
so here are some numbers here two times
faster
and three times less memory
which is huge
and larger batch sizes this is key so
uh very old paper at this point but a
paper that is definitely a seminal work
in machine learning is that don't
decrease the learning rate increase the
batch size
and basically what they describe in this
paper is that
larger batch sizes are just inherently
better right because the bigger the
batch size the more kind of stable the
gradient or the kind of direction is
right like if you just have a couple
images if you have a small batch size
then the direction you take in that
gradient step that results from that
batch can be kind of noisy right and you
can end up taking kind of steps that
kind of move around for no like there's
a lot of noise in them right but if you
have a very large batch the average kind
of
the direction that you end up getting
when you take a gradient step from that
large batch is more in line with the
entire data set right so you kind of end
up taking it like a straighter line
through this lost landscape
regarding free training data we have
built an automatic pipeline to filter
and rebalance data sets
so this is similar to what we saw in the
segment anything paper where basically
they have this human in the loop kind of
like semi-automatic
pipeline where
kind of like initially the humans are
labeling and then the system labels more
and more and then the humans are just
kind of like uh confirming and uh
checking to make sure that the system is
labeling correctly
data similarities are used instead of
external metadata metadata and do not
require manual annotation
a major difficulty when dealing with
images is to rebalance Concepts and
avoid overfitting on a few dominant
modes
okay interesting so you have this kind
of mode collapse potentially where
there's specific kind of solutions that
the uh
neural net can end up in that are like
kind of good for solving the thing but
are just mostly at local Minima
or a local Maxima depending on what you
want to measure
we gathered a small but diverse Corpus
of 142 million images just just a little
small data set you know just a 142
million images just casual
we provide a variety of pre-trained
visual models called Dyno V2 trained
with different Vision Transformers
architectures on our data
we released all the models
you know open AI like they don't release
and then meta here is releasing
everything so
meta is more open than open AI which is
kind of weird to think about
we validate the quality of dynode V2 on
various computer vision benchmarks so of
course you're going to see some kind of
imagenet potentially Coco
we conclude that self-supervised
pre-training alone is a good candidate
for learning transferable Frozen
features so Frozen here and transferable
the idea here is that you don't want to
uh fine-tune the feature encoder
sometimes when you take a pre-trained
encoder such as an imagenet encoder you
don't actually you you don't want to
freeze it right you you want to let some
of your gradients flow through it
because like that it becomes more
adapted to your task but we're moving
into a world where these pre-trained
encoders are just so powerful that if
you push gradients into them they're
just going to become worse so in that
Paradigm you would basically never
freeze your feature encoder
right and the feature encoder is uh this
thing here right this giant
backbone is another word that you can
use to call it
that are competitive with the best
openly available weekly supervised
models
competitive so they didn't get state of
the art they got competitive
intra image self-supervised training
okay so here you have different related
works
this paper is actually pretty long and I
do have a heart out so I'm probably
going to scroll through some of this
overview of our data processing pipeline
images from a curated and uncurated data
source are first mapped to embeddings
uncreated images are then deduplicated
before matched into a curated images
the resulting combination augments the
initial data set through self-supervised
retrieval system
okay
so they have
uh in these kind of like block diagrams
a lot of times the data right the
database is like kind of like shown as a
cylinder
so here they're showing you they have
this giant database of
on curated data and curated data and
obviously the size here shows you that
there's much more uncurated data than
there is curated data then they take all
of these images and they feed them
through some feature encoder which is
probably just going to be a version of
the feature encoder that they're using
and it'll give you an embedding right
and you can use the similarity between
those embeddings to compare the uh
different images right so the same way
that people use Vector databases to
store text in such a way that you can
use similarity to compare different
parts of text which is super hot in the
llm space right now you can do the same
kind of thing with images right you
could you could embed all your images
and then you can use similarity between
the embedded images in order to find
similar images
which is what's going on here right
deduplication is basically just saying
hey these two images have almost
identical embeddings therefore let's get
rid of it and retrieval here is the uh
idea of like okay let's go and give me
images that are very very similar in
their embeddings to this image here and
that's what you get there
so this kind of self-supervised
retrieval system where the system can go
into the database find the images that
are most similar to the one that it
currently has and then maybe curate its
own little mini batch that it trades on
growing body of work is focused on
scaling the abilities of self-supervised
pre-training and model size
automatic data curation
okay data processing we assemble our
curated
lvd-142 Mill data set so this is the
actual data set that they use
and again like props to them for
actually naming it you know and like
make like telling you what they're
training on it's not just like a
hey we trained on a data set but we're
not going to even tell you what it is or
how big it is or anything like that
right they do tell you how big it is and
they do tell you kind of how they got it
uh images that are close to those in
several curated data sets we describe
below the main component in our data
pipeline including the curated Dash
uncurated data sources
the image deduplication step and the
retrieval system so there's a couple
different components of here you have
the curation and on curation you have
deduplication and then you have
retrieval
does not require any metadata
or text so they're not training a clip
right they're not training something
that has knowledge of both text and
image this is a pure image
Foundation model
uh is detailed in the appendix and
contains imagenet22k the train split of
imagenet 1K Google landmarks and several
fine-grained data sets so this is the
actual
components of
their lvd 142 mil
we collect a raw unfiltered data set
from a publicly available repository of
crawled image data
okay
we extract URL links of images
we discard URLs that are unsafe or
restricted by domains and post-processes
the downloaded images PCA hash
deduplication
NSFW filtering and blurring identifiable
faces
I wonder how much they're missing out on
you know like in these type of
like how much of all the images in the
internet are like not safe for work
right
it's probably a lot there's probably a
huge chunk of
of image data that they could be trading
on
right so who's going to be brave enough
to train on all the porn data
that's the real question
we apply the copy detection pipeline of
pissy to the uncurated data
and recover remove near duplicate images
so this is where they're doing the
similarity of the embeddings this
reduces redundancy and increases
diversity among images
we also remove near duplicates of images
contained in the tester validation set
of any Benchmark used in this work
a lot of filtering going on here
we build our curated dating sit curated
pre-trading data set by retrieving
images from our uncurated data source
that are closed to the images in our
curated sources
we first compute an image embedding
using a self-supervised vith network
pre-trained on imagenet 22k
and then use cosine similarity as a
distance measure between these two
images so
this is interesting I thought that they
would have basically the way that they
embedded these images they would have
used the model itself and then kind of
constantly updated that model to get
better and better and better in beddings
but it actually sounds like what they're
using to create these embeddings for the
retrieval and deduplication is actually
just a pre-trained vision coder huge so
via the H here means huge so it's a
bigger one right there's different sizes
you have vit small vit base vit large
and then vit huge
and the 16 here refers to the fact that
this particular Vision Transformer cuts
the image into 16 patches right so 4x4
patch
and then uh cosine similarity is a
measure of similarity between two
vectors right
so it tells you how similar two vectors
are so two vectors that are basically
like this very high cosine similarity
two vectors that are like that pointing
in opposite directions very low cosine
similarity
uh given a query data set if it is large
enough to retrieve n typically four
nearest neighbors for each query image
if it is small we sample M images from
the cluster corresponding to each query
image we adjust n and M by visual
inspection of the retrieval result
they probably have an internal Vector
database right like this to me screams
internal Vector database
the deduplication and retrieval stages
of our pipeline rely on the face Library
fast
AI search image search something like
that
to efficiently index and compute batch
searches of its nearest embeddings
foreign
oh I remember this yeah
yeah this is like the internal Facebook
similarity search
written in C plus with complete wrappers
for Python and numpy someone needs to
rewrite this in Rust
every time I see C plus plus now I'm
like
they should rewrite it in Rust but I
feel like rust developers are like
extremely hard to find because it takes
a special type of person to learn rust
uh we heavily leverage its support for
GPU accelerated indices using inverted
file indices with product quantization
codes
yeah so these are all the different
tricks you can use to basically make
similarity search faster
the whole processing is distributed in a
compute cluster of 20 nodes equipped
with like look at these monsters here
you have
20
nodes right and each node has eight v
132 gigabyte gpus like
holy like this is these are Monster
systems right
and
you know I stand in awe of these kind of
systems because they're very powerful
but also it makes me a little bit sad
right because
I feel like five years ago you read a
machine learning paper and and they were
like oh we trained this on like a GPU on
our consumer hardware and it was amazing
because you're like oh that means I
can train that right but I feel like
every single paper that I read now it's
like the hardware that they're using is
just Way Beyond
everything how much do they cost each
let's see
a V100 32
gigabyte
you know this isn't as bad as the a100s
this is like a 3 000 GPU maybe four
thousand dollar GPU
but there's eight of them right so it's
three thousand dollars
times eight and then there's 20 nodes
so the total cost of that training rig
is over it's like basically half a
million dollars
so this 20 node 8 V 132 is a half
million dollar system
we learn our features with a
discriminative self-supervised method
that can be seen as a combination of
Dyno and iBot losses with the centering
of
swav so here are the different level
different losses right and the the
losses the loss for this is probably
going to be quite complicated there's
gonna be like 10 different uh terms to
it
right they also have regularization here
uh high resolution training phrase We
rapidly introduce each of these
approaches but more details can be found
in the related papers okay so here are
all these different tricks let's see
Image level objective
we consider the cross entropy loss
between the features extracted from a
student and a teacher Network so in a
distillation framework you have a
teacher Network which is the big one and
then you have the student which is the
small Network
a small model the small neural net
both features are coming from the class
token of a vit
obtained from different crops of the
same image okay so they're cropping the
different parts of the image and then
feeding that into a vision encoder for a
vision Transformer which then gives you
visual tokens right and visual tokens is
just another way to say an embedding for
an image
we learned the parameters of the student
and build a teacher with an exponential
moving average of its past iterates
right so
this we saw yesterday right the
exponential moving average this idea of
having multiple different models that
are all slightly different and then you
basically take the average of all their
weights and that becomes kind of the
the big model
right this is uh something that we that
is much more of a thing in reinforcement
learning
because of the way that reinforcement
learning works and you have to kind of
spread your model in order to gather
experience but
the same kind of like uh distributed
training requirements are resulting in
EMA being more and more popular for
everything that isn't RL
we randomly Mass some of the same input
patches given the student but not the
teacher we then
so here's it's kind of like a drop out
kind of thing right so the vision
Transformer cuts the image into four by
four patches 16 of them and you're going
to basically zero out some of them right
so it's kind of like a drop out
basically
we had a cross entropy loss between the
patch features on both networks on each
Mast patch this Lodge is combined with a
image level loss
untying head weights between both
objectives we observe that tying the
weights associated with both objectives
makes the model under fit at the patch
level while overfitting at the Image
level
untying these weights resolves this
issue and improves the performance at
both scales
Okay so
underfitting at the patch level versus
overfitting at the Image level so
overfitting at the Image level means
that at the highest kind of point of
your model right your model is this kind
of like there's layers that go all the
way from the layers that are close to
the image which the low level features
which are the patches and then you go up
and up and up and up all the way to the
clat like the head right the model head
right so you have the patch level
part of the model which is the bottom
and then you have the head part of the
model which is the top so what they're
saying is that you can actually have
under fitting at the patch level and
overfitting at the Image level which
means that your model head is kind of
over fit to the data right
it it has already memorized the data
because it's smaller the head is smaller
but the patches especially if you have a
huge Vision Transformer there's a lot
more model capacity in there so
they're actually under fit
so this is kind of an interesting uh
uh observation there where when you have
these giant models where the bottom is
just absolutely massive and then you
have like these tiny little
classification heads right that maybe
only have a thousand uh classes at the
top you can end up in a world where your
head is over fit and then the bottom
part of your model is underfit
uh sinkhorn knob centering
recommend to replace the teacher softmax
centering step of Dino and iBot by the
sinkhorn knob batch normalization
foreign
know what the this is but it's
probably seems like it's just some kind
of normalization batch normalization
layer norms and probably
500 IQ like combination of like
normalization at specific layers right
coleo regularizer another uh regularizer
here
differential entropy estimation
encourages a uniform span of the
features within a batch okay so another
type of like
uh batch Norm where normally what bash
Norm is saying is that the actual values
for all the kind of activations in
between layers of for the batch should
be roughly the same right you don't want
uh if you have a batch of 10 images you
don't want one of the images to have
like super high uh activations and then
everything else in the batch is
basically just zero you want to like
kind of normalize them in such a way
that all of them have a little bit of
signal right so you can have a little
bit more information coming through
so
this is kind of the same idea
given a set of n vectors
you have the loss coleo so this is right
L this fancy script L just means it's a
loss function right so you want this to
be lower
you have negative 1 over n and the
summation from I equals 1 to n so this
just means an average an average of the
log of dni where dni is the minimum
x i minus x j
so if you have n vectors X1 to xn right
and there's going to be
a batch is going to be a set of vectors
right if you have a batch of 10 images
you feed them through your image encoder
you're going to get a set of or a batch
of 10 vectors and then they're saying
okay
the difference between the minimum
difference between these vectors
and then I want the minimum difference
between these vectors the absolute
difference here right that's what these
little double bars mirror mean and this
is L1 L1 loss right
the log of that sum it over all the
batch get the average That's the Law so
it's like an extra regularization term
we also L2 normalize the features before
Computing this regularizer
okay so a lot of fancy uh regularization
and normalization going on here
uh adapting the resolution increasing
image resolution is key to pixel level
Downstream tasks such as segmentation or
detection where small objects disappear
at low resolutions
yeah this is kind of interesting because
one thing that we saw right it's in this
paper here at the very beginning is uh
this picture here of these horses like
these horses are absolutely tiny
right like look at this picture it's
like an overhead image of like 50 horses
on a Green Field and it's picking out
the individual horses right so if you
were to like down sample all these
images to like a 256 by 256 or something
you would lose all of that right you
would no longer have
the ability to like kind of pick out
tiny things in large high resolution
images so how do they do that
so the way that they do that is that
they train at high resolution which is
time and memory demanding
and instead they increase the resolution
of images to 518 by 518 during a short
period at the end of pre-training
okay so they have like a schedule here
right
or a curriculum is another word for this
right where you have trading on a
specific data set at the beginning and
then you train on a different data set
afterwards right you have a curriculum
that you that you use
and here the curriculum is low
resolution images and then high
resolution images
we consider several improvements to
train the models we train models on
a100s using pytorch 2.0
that's pretty cool they're using kind of
the bleeding edge
what did I just do I just accidentally
went all the way down
uh
the code is available along with
pre-trained models used for feature
extraction
so the pre-trained model that I assume
they used to uh get the embeddings that
they use for the similarity search
that's probably what they mean by this
one
with the same Hardware compared to the
iBot implementation the dyno V2 code
runs around two times faster
and only one third of the memory
fast and efficient and memory efficient
attention yeah so
one of the biggest problems with
Transformers is because they basically
multiply every vector by every other
Vector like the length of your sequence
is actually determining the size of the
overall memory right so if you have very
small sequence your memory footprint is
going to be small but as soon as you
increase the the sequence length right
the memory grows quadratically with that
so
Transformers are very
memory hungry right there's some
techniques that people have come up with
to reduce the amount of memory that
Transformers take but it's still it
could still be pretty bad so let's see
what they uh do here we Implement our
own version of flash attention to
improve memory usage and speed on the
self-attention layers
our version is on par with or better
than the original on all cases
considered while covering more use cases
and Hardware
due to the GPU Hardware specifics the
efficiency is best when the embedding
Dimension per head is a multiple of 64.
yeah
so this is an important thing to
consider is that
sometimes people don't realize it but a
lot of the hyper parameters for the
model architecture are not even chosen
because they result in better
performance they're chosen because
they're specific to the hardware that
they're trained on right
so Google models are going to be
the hyper parameters for the model
architecture of a model that's trained
at Google is going to be specific to the
size of the TPU that it's being trained
on right
the
model parameter or the model hyper
parameters for the model architecture
that uh Facebook trains are going to be
specific to their a100 gpus so
these Dimensions right the size of the
model head the size of the uh
embeddings that you have within your
vision Transformer all of those are
going to be specific to the hardware
uh Matrix are even better when the full
embedding Dimension is a multiple of 256
as a consequence our vitg architecture
slightly differs from the architecture
proposed in order to maximize compute
efficiency
we use an embedding dimension of 1536
with 24 heads rather than 1406 with 16
heads
yeah so
bigger model
but also chosen so that works better
and this is like a this paper is booby
trapped with
with uh
references so I can't click
fourteen thousand
our vitg backbone counts 1.1 billion
parameters which is quite big
nested tensors and self-attention our
version also allows running in the same
forward pass the global crop and the
local crop
that have different numbers of patch
tokens
leading to significant compute
efficiency gains can pain compared to
using separate forward and backward
passes as done in Prior implementations
Okay so
basically
how do I describe this
I don't know let's not describe this
efficient stochastic depths
we Implement an improved version of
stochastic depth that skips the
computation of the dropped residuals
rather than masking the result
so
in these Transformers sometimes you mask
specific parts and also they said that
they were dropping out specific patches
in the Transformer so
whenever you have Dropout in your model
and just Pi torch a lot of times it's
actually still getting calculated and
then it's just getting zeroed out right
so you're actually spending some amount
of compute calculating uh activations
which are just going to get dropped out
so you could probably save by not having
to calculate those
hope you're doing well it will be great
if you could try to do live
implementations from scratch by
referring to the architecture in the
paper
yeah I mean I can try to but I think
it's important to realize that the
implementations of papers
is kind of fading away right
it's possible it was POS it used to be
possible to implement uh machine
learning papers because they were
trained on similar hardware and they
were made by basically one or two people
and a six-month kind of research project
but
this is not that right this is a model
trained on million dollar systems uh
created by teams of 20 people so it's
basically impossible to re-implement
these right you're not going to
re-implement this paper
what you can do is you can take this uh
Vision Transformer and use it in your
own uh technique right you can you can
go and you can download this exact
Vision Transformer and you can use it
for some kind of cool new interesting
thing
or app or or task that you have I think
that's definitely doable you can do that
as a single person but as a single
person it's basically impossible to
re-implement this paper because you
don't
there's just not enough time right
you're not a 20-person team you're not
going to have a half million dollars to
spend on gpus
uh this saves memory and compute and
proportion approximately equal to the
drop rate thanks to specific fused
kernels
uh fused kernel so obviously whenever
you create a deep learning uh model
it gets compiled into these Cuda kernels
which are what actually is running on
your GPU and those Cuda kernels like
joining them together or fusing them as
it's called is
one of the best ways to get better
efficiency and uh
speed or use less compute basically so
that's another
requirement that is driving the model
architecture where we saw we know how
the model architecture here they were
describing how
it's
they're choosing the dimensions of these
things based on what fits inside the
gpus and then not only that but then
also the specific ordering of these
kernels and like is also chosen because
they want specific current specific uh
operation specific Ops to be close
together so that they can fuse them
right so
the hardware is driving the model
architecture development
with high drop rates this allows the
drastic improvement in compute
efficiency and memory
this implementation contains consists of
randomly shuffling B batches over the
batch Dimension and slicing the first
one minus D batches for computations in
the block
so basically if you have a very high
drop rate
then you can save a lot on compute
fully sharded data parallel
so data parallelism is a form of
distributed training right you have
model parallelism and you have data
parallelism one kind of way to think
about it is that in model parallelism
you have your model split across
multiple devices in data parallelism you
have your batches or your data split
across multiple devices in in practice
usually there's a combination of both
there's both data parallelism and model
parallelism
minimizing our objective with the atom W
Optimizer requires four model replicas
in float32 precision
so there's four versions of the model
in float32
and this is interesting here so
I would have thought that they would
have done this in mixed Precision so
right when you're training models
every single parameter in that model is
taking 32
units of storage right so their float 32
takes up 32 something like a float 16
takes half as much memory because it
only takes 16 and then something like a
a unit 8 takes eight and then something
like a four byte right so there's like
basically you can keep having the memory
by reducing the Precision
so mixed uh or mixed Precision training
is something that's popular
and I'm curious as to whether they did
that but let's see
uh so this sums up to 16 gigabytes of
memory for a billion parameter model
okay that's kind of cool
so I could actually fit that on my 3090
in order to reduce this memory footprint
per GPU we split the model replicas
across GPU
sharding 16 gigabytes across gpus using
the pytorch implementation of
fsdp which is just fully shorted data
parallel consequently the model size is
not bounded by the memory of a single
GPU but by the total sum of the GPU
memory across compute nodes yeah
so this is more model parallelism
the pytorch implementation of fsdp
brings a second Advantage which is to
save on Cross GPU communication costs so
this is another kind of theme where more
and more uh the GPU is no longer the
limiting factor in these training
problems right usually it's not the fact
that your GPU can't Matrix multiply fast
enough which used to be the case
nowadays the uh limiting reagent is
actually that the GPU is sitting there
idle waiting for information to be sent
to a different GPU or come back to from
the
computer right so
the communication between these gpus is
actually now the limiting factor which
is why you're seeing uh the rise of kind
of these
Advanced kind of like Hardware
interconnects that like uh I think the
best example of this is the Tesla Dojo
chip right
where
they basically put these like right next
to each other in such a way like this
yeah so that the communication between
these is a lot faster
right
because if you look at like a server
rack right now the the data has to go
through a pcie slot into the memory and
then back into
another thing right so like the
communication is starting to become part
of it so
at this point people are starting to do
more crazy things like these uh compute
planes compute mesh right where the gpus
are like right next to each other so
that they can very quickly communicate
and you're no longer limited by that
foreign
uh the weight shards are stored in 32
precision as required but broadcasting
weights and reducing gradients is done
in floating float 16 Precision okay so
they are doing some kind of mixed
Precision stuff here right
MLP head gradients are reduced in float
32 to avoid training this leads to
approximately 50 reduction in
communication costs compared to the
float 32 gradient all reduce operations
used in distributed data parallel
which is used in other self-supervised
pre-training methods
as a consequence the training procedure
Styles scales more efficiently than DDP
with float 16 Auto cast when scaling the
number of GPU nodes
overall pytorch fsdp makes Precision is
superior
2tp with AutoCast
very cool and you know what's even
cooler is that all of this is available
right
I'm telling you like low-key I feel like
uh meta's
meta is the open AI company meta is much
more open they release their tools they
talk about their tools they talk about
the techniques
like that's what I want to see you know
I want to see that I like this building
out in the open
much more commendable than uh
opening eyes building in secret
most of our technical improvements to
the trading Loop aim at improving the
training of large models over large
quantities of data for smaller models we
distill them from our largest model
instead of training them from scratch
yeah this is huge like it seems like
such an easy thing to like into it of
like hey rather than training four
different size models from scratch why
don't we just train one really huge
model and then just distill the smaller
ones from the bigger one
that does seem like
I'm like wow why didn't people think of
that before
since our objective function is a form
of distillation from the teacher Network
to the student Network we leverage the
same training loop with a few exceptions
we use a larger model as a frozen
teacher
keep a spare EMA of the student that we
use as our final model so this is the
exponential moving average of the
student they probably have multiple
students that are being trained in
parallel and then they basically average
them together to have the uh
student that they end up publishing
look at that nice little ablation study
look at that so they tell you here are
all the different uh
techniques that they described for
trading these big models and then here
is all the improvements that you get
so
layer scale stochastic depth teacher
momentum tweak warm-up schedules batch
size what's the biggest one here is this
there you go dude that's or actually I
guess this
or reproduction I don't know what that
means
making the batch size big is so
important
that's something that like I feel like I
learned and time and time again it's it
just seems to be the most important part
it's like the bigger your batch size the
more stable you're training and the
better the final solution that you get
to
which is unfortunate because like as an
independent researcher as someone who
kind of only has uh like consumer gpus
you can't train on these giant batch
sizes you need these kind of distributed
systems distributed kind of like
multiple nodes with like hundreds of
gpus in order to have these massive
batch sizes
foreign
performance as in our experiments the
linear probe performance is lower
bounded by the k n performance
some modifications like layer scale and
high stochastic depth rate 0.4
incur a decrease in linear Pro but at
the benefits of increasing the stability
by avoiding Nan loss values
these modifications allowed for the next
set of improvements to be added
we present a set of ablations to
empirically validate different
components of our pipeline the technical
modifications the pre-training data and
the impact of model distillation
we consider various Downstream tasks
that are described in section 7.
okay so this is kind of a description of
all the different ablation studies so
basically that table but they're going
to go through all the different parts
here
our approach improves the eibot method
by combining it with several existing
components described in section four to
evaluate their importance we train
multiple models where we successively
add components to the Baseline iBot
model
so we report the top one accuracy so top
one accuracy is the hardest accuracy so
top five is the is basically as long as
your model as long as the answer to the
actual classification problem is within
the top five responses with the highest
confidence then you count that as a
success top one accuracy is much more
stringent it means that you have to the
highest confidence class has to be the
right answer
generally we observe that each component
improves the performance on either K N
or linear probing only layer scale and
stochastic depth blah blah okay
quality of the features is directly
related to the quality of the
pre-training data that's a
obvious statements 101 right there
we randomly sample 142 million images
from the same data source
we train a vitg 14 on each data set for
the same number of iterations and
include a variant of imagenet 22k
the most Salient observation is that
training on a curated set of images
works better on most benchmarks than
training on uncurated data
foreign
I mean are they keeping the number of
images the same or are they
yeah so this is the problem is that
they're comparing a 142 million curated
image data set to 142 million uncurated
images so if the size of the data set is
the same of course the curated data is
going to be better right but if you were
to say 142 million curated images
compared to 300 million uncurated images
now I don't know if you would get the
same result it might be the case that
the uncurated images because they're
just bigger would be better
what are your thoughts on cloud gpus
Cloud gpus are
very useful I guess it's like kind of
the way to go like if I was at a startup
I wouldn't buy gpus and train things
locally I would use the cloud the
problem is that cloud gpus can get very
expensive very quickly so unless you
have a bunch of VC money to to burn the
cloud gpus are generally you know I'm
saying prohibitively expensive for
individual people
like if you want to mess around on
your old GPU at home you know that's not
that expensive but if you want to mess
around on like a100s
pretty soon you're going to end up with
a couple hundred dollars of AWS bills
you know so cloud gpus is pretty much
the way to go it's just expensive so you
have to be a startup or a company or
maybe an academic institution that has
kind of a budget
when compared with models trained on
imagenet22k training is also Superior to
all the benchmarks
okay what do we've got here ablation of
Open Source training data we compare the
inet 22k that was used
okay so here you have different training
data sets you have their 142 million
curated you have the 142 million
uncurated
and you can actually see here the
difference it's only it's only a slight
difference or actually it's a big
difference here look at that
59 compared to 73 for the uncurated
versus curated
and imagenet 22k here
it's actually very similar look at that
so I mean what I'm learning from this is
that the imagenet 22k data set is
roughly equivalent to the lvd 142 mil so
how many inet
uh 22k
100 22k data set
size
how many images does this have
foreign
I think I actually see it here so you
see imagenet 22k has 14 million images
an lvd 142 mil has 142 million images so
this is kind of interesting here that
this data set which has 10 times less
data right this is 14 million images is
getting slightly better performance on
imagenet 1K than the lvd 142 mil
I mean obviously it's a more specific
data set so it makes sense that the
imagenet data set is going to be like
kind of better for imagenet but
still crazy that 10 times more data
doesn't give you a huge performance
boost
model size and data
we quantify the importance of scaling
data with model size as the model size
grow training becomes more beneficial
than training on imagenet 22k
yeah so the bigger models if you trade a
big giant model on a small data set it's
just going to overfit hard so you can
only train these big models if you have
big data sets
the two go together
vit G trained on lvd142 matches the
performance on imagenet 1K
we validated the proposed technical
improvements by adding them
incrementally
this section analyzes the performance
hit observed if we ablate specific loss
terms starting from our best performing
model
we ablate the importance of the coleo
loss and the impact of the Mast image
modeling term
uh 80 20K so here are a couple different
benchmarks that they're going to use to
compare table sheet 3A shows the impact
of using the collio loss
model scale versus data scale so here on
the x-axis you have the uh sizes of the
vision Transformers that they're used
right so l
uh huge and then G is the biggest one
so these are the smaller and the bigger
and then on the X or Y axis here I guess
this is probably the performance the top
one performance or something like that
so you can see here that
uh the bigger models right are able to
more effectively use the big data set so
this is a inet 22k is a small data
smaller it's like 10 times smaller than
lvd142 mil so you can see that the when
you have a smaller model right a vitl
which is still a pretty big model
the smaller model doesn't use the big
data set as effectively as the bigger
model
if you give the big model and the big
data set you get better performance but
small model big data set doesn't do as
well
which is kind of what they're showing
you here
uh for small architectures we distill
larger models instead of training them
from scratch
we use the distillation procedure
described in section five
uh we evaluate the effectiveness of this
approach by comparing a vit l14 trained
from scratch with one distilled from a
vit G14
okay this is pretty cool so
all right so we were talking about
distillation as a method of basically
having smaller versions of the model
right you train one giant model
from scratch and then you distill the
smaller models
but is that going to be the same
performance as a small model trained
from scratch
and that's what they're showing you here
is
the vitl trained from scratch is this
blue and this is the score that it gets
on all these different categories I kind
of like this this weird table here right
so each of these is a benchmark so cars
food imagenet Kitty which is a kind of a
autonomous vehicle data set and so on
let's zoom in here
so you can see that actually when you
train it from scratch it's worst on
everything like the distilled model is
actually better on everything and and
here's the even crazier thing the
distilled model is better than the
teacher model
right like what the that's weird to
think about you take a big model you
train it from scratch you use the big
model to train a smaller model right you
the smaller models just still from the
bigger bottle and it turns out that the
smaller model trained on the bigger
model
is better at Oxford H and Paris H
I think that's kind of weird
right not intuitive
we show that a vitl model distilled from
a frozen vitg outperforms the same model
and sometimes even outperforms the
distillation Target
we measure the impact of changing the
resolution during pre-training on the
performance image of image and Patch
level features
we consider models trained from scratch
using a fixed resolution of 224
or 416 by 416. so these are the two
different sizes that they train at
uh we resume for 10K more iterations at
4 16. so they're doing this kind of like
curriculum like alternating training on
uh larger images and smaller images
we report the performance of a linear
probe evaluated at various resolutions
the model train on high resolution
images performs the best across
resolutions but it comes at a high cost
training at 416 pixels by 416 pixels
takes three times more compute than
training at 224.
so there's this kind of trade-off of
like ideally we train
on
bitter lesson seems to kill low budget
academic AI simply scale up everything
yeah that's true
uh how does distilling differ from
transfer learning so transfer learning
is taking a model that has already been
trained and then using it for a new task
so in transfer learning you're still
using the original model but when you're
distilling you're you're training a
separate smaller model right so
distillation is taking a big model and
then using it to train a small model
transfer learning is using a model and
then training it on a new task
this is
training on high resolution for only 10K
iterations at the end of the training is
almost as good and only requiring a
fraction of the compute
in this section we present the empirical
evaluation of our models on many image
understanding tasks
we evaluate both Global and local image
representations on category and instance
level recognition
so these are a couple different uh
computer vision tasks right you have
instance level recognition semantic
segmentation monocular depth estimation
or prediction and then action
recognition so this is a lot of
different type of tasks right action
recognition is kind of like
classification and a lot of times this
is like a post detection task right so
you're like kind of detecting key points
on a human or a hand or something like
that monocular depth estimation is
taking a single camera image and then
giving you the depth image from that
so it's like a pixel level task because
you have to identify the depth at every
single Pixel
instance level recognition that's more
of like a bounding box task so it's not
pixel level it's just giving me the
bounding box of a specific uh instance
right or like
object within the image and then
semantic segmentation is also pixel
level because it's like you have to tell
me what the class of every single Pixel
in that image is so you have
a couple different a lot of like a big
smear of different tasks here you have
like pixel kind of level tasks
and you have a
more high level things such as detection
and action recognition
so we train linear classifiers on top of
the Frozen features linear classifiers
is just a fancy way of saying like a
little tiny model head
right so if you had imagenet 1K the
linear classifier is basically you're
taking that giant pre-trained uh feature
encoder and they freeze it so they don't
let any gradients get into it
and then you just put a little tiny head
on top and that little tiny head has 1
000 outputs and each of those 1000
outputs represents one of the internet
classes
short duration and results close to
training okay so what are we looking at
here we're looking at the image
resolution so this is 224 by 224 and
then 768 by 768 so this is high
resolution images and low resolution
images on the x-axis
and then on the y-axis you have a mean
IOU which is basically a way to evaluate
uh bounding boxes so
here's a this is a low IOU this is a
high IOU intersection over Union right
so like how much does the bounding box
that you predicted overlap the true
ground truth bounding box
right so higher is better and then
higher is better as well on accuracy so
accuracy is this is a classification
task so accuracy is basically did you
get it correct
and you can see here that the low
resolution
goes down so if you're training your
model at a low resolution it does not
perform well at high resolution
if you train your model at high
resolution it does perform well at high
resolution but then this blue line here
is this kind of curriculum technique
that they just described where they
train in a low resolution and then they
train at a high resolution so they like
kind of do this two-part curriculum and
they show you that okay well actually
that works pretty much
as good as the high resolution
right it's still better to train at the
high resolution if you if you wanted to
but it's going to be so much more
compute heavy that you're actually
better off doing this kind of like
curriculum technique where they train it
at a low resolution and then a high
resolution and it performs
quite about the same
second we show that they match or
surpass the performance of a weekly
supervised ones on substantial image
number a substantial number of tasks
in our comparisons we use two kinds of
models as bass lines we compare the best
performing self-supervised models that
are openly available
we run our evaluations for Mae Dino Seer
MSN esvit and iBot these are just
basically a bunch of things that they're
comparing to
several architectural variants were
proposed we report results for the one
that leads to the best top rule one
accuracy
we report performance of Open Source
weekly supervised models such as clip
okay so they're of course going to
compare to clip
because clip is like extremely popular
uh for reference boba
okay imagenet classification
as a first evaluation we probe the
quality of the holistic image
representation
produced by the model
so what does that mean what is the
quality of an image feature right
this is
this is a fundamental problem in machine
learning in general right the quality of
your embeddings the quality of your
features
and
right now kind of the gold standard is
basically to have a nice varied set of
benchmarks right and this is not just a
problem in computer vision it's a
problem in uh natural language as well
or any kind of image modality right
how do you determine the quality of the
features of a giant llm right
well you have a variety of benchmarks
and then you evaluate its performance on
all those benchmarks
so that's kind of the same thing that
you're going to do here for the uh
computer vision model is like okay well
are these features good are they better
than that feature or that feature right
it's like
it's impossible to know as a human
because it's just like what what is a
1000 dimensional Vector like
it's basically you can't understand what
the that even means so how can you
judge the quality of it so
right now the way that people do it is
they basically just create these like
benchmarks and they have like as big of
a set of benchmarks as they can and then
the model that performs the best on all
the benchmarks has the best features but
I suspect that over time we will start
to learn more and more about like what
those features actually mean maybe
better techniques for understanding
maybe clustering the features like
I suspect that will develop kind of a
whole science around feature
understanding and kind of like
that will become the new way to
determine feature quality rather than
what we're doing now which is basically
just uh using benchmarks in order to
kind of as a proxy for the feature
quality
because most SSL methods using uh
validation performance we also Report
top on accuracy on imagenet
we run the evaluation with our code our
We compare our Frozen features to the
best publicly available SSL features
regardless of architecture or
pre-training data
we also see that the performance
increase on alternative test sets is
larger for our method indicating
stronger generalization so
again this we don't really know how to
measure generalization well
other than to just basically evaluate
the model on a variety of different
tests and then see if it performs well
across all of them
we also want to validate that our
features are competitive with
state-of-the-art open source weekly
supervised models
so open clip
Eva clip
let's see let's see how you perform this
is a magic number right here so we got
clip
with a vit large
uh
and then these are different uh
data benchmarks here and you get 79.
you have Eva clip with a vitg so this is
a bigger bigger clip
83.
Dino V2
with the vitg bigger one 83.
okay so it's it's not better but it's
competitive I see what they're saying
how does it compare to dino vits I mean
this is a smaller one so it's not a fair
comparison
78.
right small vit with only eight patches
you get 78.
small vit with 14 patches you get 79 so
this is a little bit on like maybe
unsettling it tells you that Dyno V2 is
not necessarily that much better than
Dino
V1
really it's just bigger
right
and this here this 14 336 I think this
just means that the
the the head Dimension itself is 336 so
it's a slightly bigger head so vitl14 is
slightly smaller than a vitl 14336.
can we fine-tune the encoders
we question of the ability of our models
to produce high quality Frozen features
impact their performance when fine-tuned
with supervision on a specific data set
yeah this is important
because I myself tried to do this right
when I
I was messing with the segment anything
model and I was using the pre-trained
feature encoder that they had
and I had
I was trying two different things I was
like okay well if I freeze this
pre-trained feature encoder and then try
to do this segmentation task is it
better or is it actually worse than if I
don't freeze it and let some of the
gradients flow through it
and
intuitively if you if you have been
doing this generally the advice up until
now is that yes letting some gradients
go through it is better than just
freezing it right
but
while this is not core this experiment
is indicative of whether we have
involuntarily specialized blah blah we
apply the fine-tuning pipeline without
tweaking hyper parameters we show that
the top one accuracy on the validation
set improves by more than two percent
here you go
but the backbone is fine-tuned so
we're still good where it seems like
fine-tuning you can still get a tiny bit
of performance right
and I don't know I feel like this is
going to go away right
I feel like over time these uh feature
encoders right these pre-trained uh
models like this right these pre-trained
backbones I think we're going to get to
a point where you actually don't want to
find two of them because they're already
so good they're already so specific
and so General right they can kind of
work on everything and the features that
they use are like so fragile because
they're they're Giant and the learning
rates that they use are very small and
like they're in they're in a local
minimum that is so deep
that if you try to find two of them you
basically just mess them up so
it's interesting to see that fine-tuning
these uh
feature encoders is still you're still
going to get a little bit of performance
for your specific task but I do feel
like this is kind of eventually going to
go away
fine tuning is optional
yeah that's that's the future is fine
tuning is optional
to complement our study and probe the
generalization of our features we
evaluate our imagenet 1K models trained
with linear classification heads
on domain generalization benchmarks
okay so these are benchmarks
specifically chosen that to be kind of
like very wide and lots of different
weird looking images in order to
determine if your model is over fit to
imagenet or not
uh they keep using SSL here that just
means self-supervised learning they just
shortened it into SSL
supervised fine-tuning on imagenet 1K
is it ready to be used I didn't see any
docs on how to use the model to input an
image and get the retrieval against a
given set of images so
yeah you can do this so if you took uh
Gustavo if you took this right here this
model
and then went created a python script
that uh fed every single one of your
images
through this uh pre-trained encoder
you're going to get a set of that you're
gonna get a vector and then store all of
those vectors in a vector database such
as uh this one here faiss right
then you can take any new image
encode it and then find all the images
that are similar to it so you can you
can Implement retrieval if you want
pretty easily
right
this is the the key is that you can
download this model right here you don't
even need to download it you can just
like load it with a python
so if you combine this with this
you can do retrieval based on similarity
we could actually probably do that that
actually probably seems like a stream
that I could do if you guys want to do
that
if you guys are interested in that uh
join the Discord and then uh
comment on that and we could totally do
that as a stream
okay let's get back to it
uh vitg
supervised fine-tuning
so here they're showing you a
fine-tuning a slightly different
resolutions
[Music]
and you can see that the slightly larger
resolution image benefits a little bit
better from the fine tuning
incorporating prompting is eventually
better than fine tuning
yeah uh
Fee nugen Van I'm sorry if I didn't
pronounce your name right but I actually
think that that's what I envisioned I
think right here this paper is obviously
they trained it with no text this is a
pure image Foundation model right it's
not like clip clip is a image and text
Foundation model and part of the reason
they did that is because the data sets
for image and text are not as good right
but I think eventually what's going to
happen is that you're going to basically
use a kind of like Auto labeling
technique right you're going to take
images you're going to use clip to
create a
uh uh caption for that image and then
you're gonna train a text and image
model on the captioned images so I think
over time the quality and the
availability of image Text data is
actually going to improve because we
have models such as clip that can
understand it so I see this kind of
this giant self-supervised pipeline
where you're kind of like captioning the
models and then using that to train and
then using the better model to caption
the model to caption more images and
then so on and you have this kind of
like flywheel that like keeps labeling
images and keeps captioning and then
keeps labeling and then keeps captioning
and then over time it actually gets
better because
I do think that intuitively it seems
like
having additional modalities right
having both image and text will result
in a better uh feature space than just
images by themselves but
we're at the point now where scale is
still King and if you can have a bigger
data set by just using only images the
features that come out of that are going
to be better
how powerful this will be to be used as
an image labeling tool
is this I mean this is the what you want
here so Gustavo this table here table
four
you see here uh
clip is about 79 Eva clip is about 83
Dyno v279 so it's not it's not going to
be like
significantly better than what you have
access to already but it's kind of on
par
right I think if you use Dyno V2 if you
use this mod this encoder and then just
basically
fine tune or not fine-tune but like use
it for your own uh task you're probably
going to get about the same uh
performance as you were if you were to
use Eva clip or any of these other kind
of large
foundational Vision models
so it's not a step function it's not
like we uh
we suddenly unlocked a new capability it
just seems like
uh competitive with the current models
domain generalization with a linear
probe see this is more interesting so
Frozen features so what they did here is
they freeze the features they say okay
I'm not going to fine tune this I'm
going to freeze the encoder and then I'm
going to try to see how good it performs
on these uh benchmarks here I am R Ima
an open clip is actually very fragile
and you see that open clip if you freeze
it and you don't push gradients into it
it actually doesn't perform very well at
all right it doesn't have the ability to
generalize
but look at Dino V2
75 that's much better that means that
the Frozen features of Dyno V2 are
actually much more General than the
Frozen features of these other models
here open clip Dyno V1 Mae and so on
so I don't know I would still use this
if I was doing a computer vision problem
I feel like this is the feature encoder
I would use today
additional image and video
classification okay so they're using
this for video now
we studied the generalization of our
features on Downstream classification
benchmarks
we consider two sets of evaluations in
that context on one hand we use large
and fine-grained data sets such as
inaturalists and places 205 Okay so
I think these are also classification
tasks
you train with a linear classifier and
data augmentations
our model significantly performs open
clip
you know this is that table right there
we measure the performance of our model
on video action recognition
right so this is basically like YouTube
videos where it's like someone cooking
and like someone riding a bicycle and
things like that and it's basically a
classification task but you have a kind
of a sequence of frames right so you're
performing classification with a bunch
of images rather than a single image
we pick eight evenly spaced frames so
it's not very dense yet right it's not
like you're looking at a two hour video
you're looking at eight frames of a
video so
video is still primitive
we see that amongst the self-supervised
approaches our model clearly sets a new
state of the art
okay so they're saying it clearly
outperforms on ssv2 and ssv2 is
a more complicated data set I haven't
really heard about this what is
something something V2 something
something B2
something something V2 so it seems like
it's like kind of
these are still pretty low resolution
but it's like egocentric
video of someone like grabbing things
putting something on a Surface moving
something up covering something with
something putting something into
something
that's kind of a cool data set I've
never heard of this
but okay it's like kind of like an
egocentric data set of like opening a
jar and putting something in it so that
is kind of interesting
ssv2 requires a much richer
understanding of the video frames yeah
it's you can't just like classify based
on texture right
Jesus Christ
uh we compare select Frozen features on
12 transfer classification benchmarks
this benchmarks cover scenes objects
textures
blah blah blah our model
outperforms other self-supervised
learning models
okay so here you have image
classification video classification a
couple different versions here's the
ssv2 that we just looked at open clip
very bad performance here
actually not it's actually not that much
worse than Dyno V2 it I guess just ssv2
is a very hard data set that's a
hard data set look at that 35 percent is
the state of the art on that
that's good you know you want benchmarks
where everyone performs poorly right you
want a benchmark that's very very
difficult like
these benchmarks here where like
everyone's scoring in the high 90s
those aren't good benchmarks because
basically once you get to like like 90
95 getting that last extra percent
is not a measure of how good your model
is it's like a measure of how overfit
your model is so this is why I think
that as our models get better and better
over time the benchmarks need to get
harder and harder over time so
I feel like imagenet 1K
is not a good data set anymore or is not
a good Benchmark anymore because it's
like every single score is high like for
example this here cfar10 I think that's
what C10 means like
this is borderline meaningless like what
does 98.7 versus 99.5 mean right it just
means it got one more image correct
basically or like a couple more images
correct and
the reason it got those correct is
probably not necessarily a good reason
so these data sets here flowers like
look at these 99.99 like
I think we need to start getting rid of
some of these benchmarks that are too
easy now
right this one much better Benchmark
you're still at 80 63 87 right you can
still actually tell what's a better
model than the other but like
these data sets that are just way too
easy these benchmarks too easy
even though these benchmarks favor text
guided pre-training our features are
still competitive with open clip on most
classification benchmarks
instance recognitions different problem
now
on the task of instance level
recognition using non-parametric
approach
uh ranked according to their cosine
similarity with a query image we
evaluated our model and compared two
baselines on Paris and Oxford
that our Landmark recognition benchmarks
we also evaluate on met a data set of
artworks from the Metropolitan Museum
okay a couple different instant
segmentation or instance recognition
benchmarks here
we probe the quality of patch level
features so
patch level features versus just
features right so normally when they say
feature is what they're referring to is
what comes out at the end of the
pre-trained uh encoder right so you have
your vision Transformer
you have your image your image gets fed
into your visual Transformer and then
outside of that you have a feature right
and that's the feature Vector that
they're normally referring to when they
say patch level features what they're
referring to is the features in the
little individual patches of the vision
Transformer right the vision Transformer
is cutting up the image into this like
grid and then each of that little grid
is getting fed into a different usually
a different part of the vision
Transformer so
you can look at the features that are at
the end of the vision Transformer or you
can look at the features that are for
each individual little patch that the
vision Transformer is uh using right
so that's uh what they mean here by the
probing the quality of the patch level
features as opposed to just the features
at the very top
uh instance level recognition so this is
a different Benchmark right so up here
we were looking at uh image and video
classification this is now instance
level recognition
and we can see how
again you're seeing
Dyno V2 kind of beating out
all the other models
and then semantic segmentation
a different kind of task again this is
semantic segmentation is I mean you guys
probably know what segmentation is
but just in case you don't
uh which is this right it's like
basically individual labeling for each
individual pixel
hey can you suggest Which models will
work well on Courvoisier data set
uh for classification purposes what is
kvossier data set
glossier data set
multi-class image data set for computer
so it's like
this
oh he's a kind of gross dude oh
lifted polyps
oh my god dude this is nasty
ulcerative colitis
okay
um first of all God bless you for uh
performing medical segmentation like
some of some of those medical image data
set tasks are like nasty like
the
the ones for like skin diseases like
like you know what I'm saying like I've
seen some on those like skin
disease data sets
uh Which models will work well so
it seems like it's like a kind of a
classification data set so here you go
man I mean classification Dino V2 vit
G14
there you go so right here this is the
one you want
vit G14 import torch torch.hub.load
uh practice take this model
uh take your data set this uh whatever
you whatever this was here
uh encode every single one of these
images with this pre-trained Frozen
encoder
and then train a linear classifier on
top of that
that's a good start I'm not telling you
that that's going to give you the best
possible performance there's probably
all kinds of extra tricks that you can
use to make a better performance for
this thing but that's probably a good
start is taking this pre-trained feature
encoder encoding all your images and
then training a classifier on top of
that
semantic segmentation for our semantic
segmentation evaluation we consider two
different setups
linear linear layer is trained to
predict class logets from a patch tokens
this is used to produce a low resolution
log it map
okay so lockets are basically the output
of a classification head before it is
put into a softmax and then turned into
a confidence
and generally your cross entropy is
actually done with the logins because
the way that the kind of kernels work
out it's actually better to do it that
way
or not better but like uh
computationally it's it's faster
foreign
this procedure is simple but cannot
easily produce high resolution
segmentations
to report the performance of our model
variants as well as the baselines on
three data sets
interestingly our evaluation is on par
with fully fine-tuning on upper net
decoder
so this is kind of interesting they're
doing like a little bit
so like
these pre-trained encoders right they
give you a feature vector and the
feature Vector is very easy to use for a
classification task because all you got
to do is just put like a little
classifier head on top of that but once
you have a more complicated task right
like a segmentation task then maybe
you want to have a little bit more
complicated of ahead and that's kind of
what they're talking about here is like
different uh ideas right like a decoder
a login map right like making little 32
by 32 login map so
a couple different there's still some
technique still some art still some
Artistry that you can use to uh
design that little uh what you can do
with those features basically like how
are you going to take those features and
use it for your classification task
all right so here we go this is uh
classification this is these are uh
uh Kitty is like a autonomous vehicle
data set
and I guess here lower is better so
we see that it's outperforming the clip
but not by much
depth estimation this is another uh
kind of dense task similar to
segmentation where you have to predict
every single Pixel
we consider three different setups we
extract the last layer of the Frozen
Transformer and concatenate the class
token into each patch token
we then bilinearly up sample the tokens
to increase the resolution and finally
we train a simple linear layer using a
classification loss by dividing the
depth prediction range to 256 uniform
delete distributed bits so
when you're doing depth right the depth
image is going to basically be the same
exact resolution as the colored image
that you input except every single Pixel
is going to be a number that represents
how far away that pixel is from the
camera right
and because the depth image is usually a
uint 8 Single Channel grayscale image
right in a uint 8 you have 256 possible
values right like 0 to 255. so
they turn it into a classification task
of like hey rather than trying to
regress the exact depth number as a
float instead of that turn it into a
classification task and for each pixel
I'm trying to classify each pixel into
one of 256 different classes so it turns
a depth estimation task into a semantic
segmentation task with 256 classes
they do some kind of concatenating the
tokens from layers three six notes so
this is kind of almost like a unit
situation right in a unit you are not
just taking the output of the encoder
and then using that to do your decoding
but you're also taking intermediate
results from the encoder and then also
feeding that in kind of like a skip
connection or residual connection kind
of idea so they're doing the same thing
here they're they're taking uh the
outputs from these intermediate layers
and then also using that in the decoder
uh interesting to see that iBot features
outperformed the ones with opening eye
clip
our model with the DPT matches or
exceeds the performance of recent work
okay qualitative we show some
qualitative results from our data set
our dense prediction evaluations
the linear segmentation produces good
results and behaves much better under
this evaluation setup
the qualitative results on depth
estimation clearly illustrate the
quantitative gap between openai clip and
Dyno V2
much smoother depth estimation
all right let's actually see these
pictures
okay so we got clip
but this is the input image this is the
clip this is the dyno V2
uh
and you can see here
how
clip has all this weird like artifacts
here
but Dino V2 seems to be a lot cleaner
this is the image this is uh Dyno V2 and
then this is clip
the dyno V2 seems better this is the
depth estimation so again things that
are purple are very are far away from
the camera and then things that are kind
of this like light yellow orange color
are supposed to be closer to the camera
so you can see here how the uh the clip
model or the the model trained using the
clip feature encoder that is also Frozen
has this kind of like noise the snow as
it's sometimes called
the uh
uh dino V2 much much cleaner much more
smooth here
yeah kind of significantly better
out of distribution examples so
out of distribution just means that
there's some distribution of images
within your data set and there's some
distribution of images that your model
has been trained on out of distribution
means there's it's an image that's
outside of that distribution it's an
image that's weird it's like unusual in
some weird way so what does that mean
that means like a drawing right so a
drawing of a room this almost looks like
a van Gogh drawing but like a drawing of
a room is
out of distribution
for a data set where everything you
trained on was real pictures of rooms
and what they're showing you here is
like look at how their model the the
model that they trained with the dyno V2
Frozen feature encoder is actually still
able to do monocular depth estimation
and semantic segmentation on this out of
distribution example relatively well and
here we have the same kind of thing it's
like a painting right this is like oil
painting completely different kind of
textures and and
and patterns here than an image but you
can still get a pretty good depth image
and this is a little bit more not as
good but still pretty good
this is a very complicated image you
have like people laying on top of other
people
and still gets it
uh we show a few examples of applying
the depth prediction and segmentation
linear classifier to out of distribution
examples in figure eight
the qualitative results support our
claim that our features transfer between
domains
the quality of the depth and
segmentation and predictor for pictures
of animals or paintings is very good
even though the domains are very
different
PCA of patch features okay so now
they're going to be doing principal
component analysis not on the uh
on the final maybe features but the
features at the patch level right so
like deeper in the vision Transformer
looking at what's actually being done at
the individual patches of these
uh we showed that the results of the
principal component analysis performed
on each on the patch features extracted
we keep only the patches with positive
value after we threshold the first
component
this procedure turns out to separate the
image's main object from the background
Okay so
basically you're getting
this is like emergent right it's
emergent that the model learns to
separate the background in the
foreground right
I'm not sure that those data sets are
actually out of distribution because
that sketch is also present in high
frequency of natural images
should be a better test on image
modality like ultrasound yeah I agree I
think I agree with you that
you can call these out of distribution
but like there's degrees of out of
distribution right like this this
painting of people is probably right
here if you have your your data
distribution is probably right here
versus like the X-ray image
like imagine if someone took an x-ray of
a rock underwater right like the X-ray
of the rock underwater is going to be
like here it's like way out of
distribution so I agree with you that a
better example of out of distribution
images would be like x-rays or like
sonar or like some kind of weird image
modality that doesn't look anything like
a natural image versus like these oil
paintings uh sketches are not as out of
distribution as they could be so I agree
with you there
foreign
okay back to this we compute a second
PCA on the remaining patches across the
three images depicting the same category
we color the first the three first
components with three different colors
and represents the results
yeah so this is the images we saw at the
beginning again kind of super
interesting that there's this kind of
emergent uh behavior of separating the
foreground in the background of course
it kind of makes sense that it would be
it would do that because of its
but more visualization we compute the
PCA between the patches each component
corresponds to a different color channel
so three components right PCA
you can separate your data into any
number of Dimensions but usually it's
separated into three components for
visualization purposes right and here
they're separating into r g and B so R
represents
the first component G probably the
second component blue probably the third
component
and what they're showing you here is
that the components for different images
end up having
kind of similar semantic meanings how
like you see how the blue tends to
represent the legs of these animals
right the green tends to represent the
head of the animal right
so not only is it have this emergent
ability to separate foreground and
background but it also has this emergent
ability to separate the head of
something with the legs of something in
the body of something so it's like it's
kind of learning the underlying kind of
patterns of our reality
emergently which is pretty cool
use component corresponds to a specific
color
uh delineating the boundary of the main
object the second other components
correspond to parts of objects
and match as well for images of the same
category
this is an emerging property our model
was not trained to parse part of the
objects right
Foundation models
have emergent intelligence
we explore what type of information our
patch level features contained by
matching them across images
we start by detecting the foreground
using the procedure described above then
we compute the euclidean distance
between patch features
so I wonder why they're doing this right
why not cosine similarity between patch
features
why euclidean distance
then we apply a non-maximum suppression
to keep only the Salient ones
we observe that the features seem to
capture information about semantic
regions that serve similar purposes for
instance the wing of a plane matches the
wing of a bird
we also observe that the model is robust
to style and large variation of poses
yeah this is the most impressive to me
like this this this overhead like
satellite it's not a satellite image
this is like a drone shot but like
this is crazy
the fact that it can recognize that all
of these are horses and that all these
horses of heads and all the heads in
this horse are more similar to this head
here
like what that's crazy
all right and you wouldn't have a
big Corporation paper without a fairness
and bias analysis so
kind of there's a lot of political
pressure on these companies to uh kind
of do these kind of fairness and bias
uh determinations right usually this
boils down to like is everyone in your
data set white or not like
is every single image from the us or do
you have images from other countries as
well right so
I think it's like this is a little bit
not necessarily
doing anything to improve the actual
performance of the model sometimes I
feel like these actually these fairness
and things like that sometimes make the
model actually worse
but
I don't know
whatever
matching across images we match patch
level features between images from
different domains poses and objects so
this is right the vit 16 maybe has 16
patches and then you can look at the
what each of the patches corresponds to
between these two different images
so you can see how the eye of the
elephant matches the eye The Head and
the ear and the ear the legs and the
legs and so on
this is quite crazy the fact that it's
able to match this uh I think it's
Ganesh right this this God the elephant
god Ganesh and then the uh elephant here
cardboard car with the car with the bus
gender skin tones and age
blah blah
fairness across gender and skin tone
it's going to perform slightly better
on females than males
it's weird
and it's going to perform slightly
better on 45 to 70 year olds than on old
people
turns out the model is racist against
old people look at that 88 versus 93
gotta shut it down gotta shut it down we
can't have old people not being
segmented properly
estimating the environmental impact okay
another thing that people kind of uh
talk about in these papers and now these
big companies have is kind of this like
environmental impacts like oh we train
for x amount of time and like it uses up
this much uh can't this much carbon and
it's like
this just seems like such nonsense to me
like to be honest like the environmental
impact of like a building is worse right
like
people make Rolexes and fancy cars and
think about how much carbon is used up
to hold a un convention where they make
this giant building and everybody shows
up and everybody goes to Davos in their
airplanes and in their Jets and they
take their private jets to Davos and
here we are being like Oh training these
models is is bad because we have too
much we're using carbon to train these
models it's like dude these models are
the future like we're going to help so
many people with these models
and you're going to try to prevent
people from training models because the
carbon footprint of the gpus is too high
like
there's so many other things that are
way more useless than this and that
people have no problem doing right like
single-use plastic single-use plastic is
so much worse for the environment than
training imagenet 1K on
gpus
like we need we need perspective here
right it's like
I don't think climate change is impacted
by Deep learning and training these
models
I think let's get rid of single-use
plastic let's get rid of pollution let's
get rid of private jets like there's so
many other things that are
much better targets for climate change
discussions than training uh large
models
okay future work and discussion
uh in this work we present Dino V2 a new
series of image encoders pre-trained on
large curated data with no supervision
this is the first self-supervised
learning work on image data that leads
to visual features that close the
performance Gap with weekly supervised
Alternatives across a wide range of
benchmarks and without the need for fine
tuning
foreign
cancel it whole erina for me it's more
of a let us use and research this rather
than protect the environment
yeah I mean
Gustavo you're on to something right
sometimes like for example this is very
prevalent in the uh biomedical space
right in the biomedical space
in order to release a new biomedical
device or in order to release a new drug
or anything like that you need to go
through so many different uh regulatory
kind of like tests and and things and
processes that it's basically it takes
millions of dollars to even
have that which means it's impossible to
compete with large companies in the
biomedical space because they just have
so much more money that they can do that
so I could see
a similar thing happening in the AI
space right where it's like you're not
allowed to train an AI system unless you
can calculate the carbon footprint of
your AI training right and if you're
just some researcher at an academic
institution you don't you don't have the
time to calculate the carbon footprint
of your AI training so therefore you're
not allowed to train the AI system which
means that over time all of the AI is
being done by these big companies that
can do all of the regulatory
uh Hoops that are required to do it so
yeah sometimes it's a little bit
it's kind of it's it's it's it's it's
Sinister right it's like it's like
Sinister because all this like fairness
and like uh environmental impact studies
like if those become requirements then
it will become very difficult for
anybody that doesn't have the same
budget as open Ai and as meta and as
Google in order to train these
yeah it's it's like a it is kind of
up in that way
a few properties that emerge
understanding of object parts and scene
geometry
we expect that more of these properties
will emerge at a larger scales of model
and data akin to instructions emergence
in large language models and plan to
continue scaling along these axes
this paper also demonstrates that these
visual features are compatible with
classifiers as simple
as linear layers
meaning the underlying information is
readily available
yeah they release an open source version
I mean
I think that to me I'm I'm
actually like I'm much more bullish on
meta AI I know like meta and and
Facebook AI research they have a lot of
kind of negative association because of
just Facebook is kind of creepy you know
and like the social media is a little
bit creepy but I don't know in the past
six months like it seems like Facebook
is not afraid to release their models
and they go here and they tell you
exactly what they're doing they tell you
what they're training on they tell you
how they're training they tell you all
the different tricks
and I like that you know I like that I
like being able to read a paper and
understand and like know what is going
on behind this foundational model
and I feel like open AI which used to be
the company that called itself open AI
that was all about open sourcing they're
the ones that are secretive now right
like I don't know how the they
trained in gpt4 I don't know how they're
training gpt5 I don't know what data set
they're using I don't know how they
curated that data set I don't know the
tricks that they use I don't know if
they're doing data parallelism model
parallelism I don't know which of these
different uh techniques they're using
the distillation techniques warm-up
techniques like
so I don't know I'm it's weird for me to
say this but I actually trust Facebook
now more than I trust openai which is
weird but I guess that's just
the way the universe works
and
I do have a heart out I have to go to a
meeting soon but
pretty cool paper I think that to me
this they they definitely demonstrated
that this this model is general and that
you can use these pre-trained vision and
pre-trained feature encoders for a
variety of tasks and you don't have to
fine-tune them you can basically just
use them Frozen you can use them as is
so I don't know I feel like this this is
powerful I can't wait to kind of see
what people do with these and to see how
the community improves
um but yeah cool paper I agree
uh hope you guys found that useful if
you guys have other work other papers
other GitHub repos that you that you're
interested in in me kind of trying out
and exploring definitely let me know uh
if not like And subscribe and
hope you guys have a good Wednesday
peace out