DINOv2

Time: 0.84

foreign

Time: 12.48

testing

Time: 16.8

testing

Time: 18.84

testing

Time: 22.02

testing

Time: 25.14

testing

Time: 28.519

test testing

Time: 33.48

all right

Time: 36.18

welcome to another hoopo stream

Time: 39.84

uh I did a little Switcheroo yesterday

Time: 42.96

so originally we were going to read a

Time: 45.3

segmentation paper kind of a

Time: 47.879

continuation of the segment anything

Time: 51.239

but uh just last night or I don't know

Time: 54.239

if it was last night but yesterday they

Time: 55.92

released

Time: 57.539

uh meta AI research

Time: 59.82

released Dino V2

Time: 62.64

uh which seems more important than uh

Time: 66.42

the paper I was going to read so I

Time: 68.52

switched and we're now going to read

Time: 69.9

Dyno V2

Time: 72.36

Time: 73.799

released basically uh

Time: 79.2

this week I don't know why it's S14

Time: 80.7

because I saw it for the first time on

Time: 83.22

uh Twitter yesterday

Time: 86.1

recommended by a friend

Time: 88.14

so this is version two of uh dino which

Time: 91.2

Time: 92.46

a unsupervised method for training kind

Time: 96.36

Time: 98.04

foundational computer vision models if

Time: 100.38

you can call it that I think computer

Time: 101.82

vision is getting to the point where uh

Time: 104.34

these things can be called foundational

Time: 105.6

models right we're no longer in the

Time: 108

world of segmentation models and

Time: 111.06

classification models and bounding box

Time: 113.579

detection models and I think that world

Time: 115.439

of having specialized architectures for

Time: 117.479

the different parts is slowly

Time: 119.82

disappearing into the mist of time and

Time: 122.88

more and more we're seeing kind of these

Time: 124.799

big giant

Time: 126.24

uh multi-task computer vision problem or

Time: 129.239

computer vision models that uh you can

Time: 131.28

apply to

Time: 132.84

a huge variety of tasks and this is the

Time: 136.8

GitHub by the way and if you look here

Time: 138.66

right the backbones that they train are

Time: 141

all Vision Transformers so

Time: 143.4

we're kind of seeing the supremacy of

Time: 145.56

vision Transformers right you don't

Time: 147.84

really see a lot of uh conf Nets as

Time: 150.66

encoders anymore

Time: 152.64

but these are Big right you got big

Time: 157.98

big Vision Transformers that are

Time: 160.56

supposed to be task agnostic

Time: 163.14

so look at this you can even load them

Time: 165.36

directly from python this is quite

Time: 167.7

quite good

Time: 170.64

I got the conda environment

Time: 172.98

Al requirements

Time: 178.2

yeah and the problem maybe the one

Time: 180.54

negative of this is if you look at this

Time: 182.04

right these like training uh runs that

Time: 184.08

they run they're running on 12 a100s

Time: 188.22

like this is

Time: 190.44

you know like I feel like the previous

Time: 191.879

generation of computer vision models you

Time: 193.98

could kind of maybe train them at home

Time: 195.54

or at least maybe fine-tune them at home

Time: 197.099

but nowadays these things are just so

Time: 199.019

huge you can't you can't you can't even

Time: 200.459

run these you know

Time: 203.519

but it's good you know we're kind of

Time: 205.44

moving forward and we're seeing the

Time: 207.3

beginnings of foundational models for

Time: 209.22

computer vision

Time: 211.14

they're not the beginnings but kind of

Time: 213.06

the supremacy of them

Time: 215.76

so let's get going here so

Time: 218.4

as we see these Foundation models

Time: 221.28

become the norm in the machine learning

Time: 223.799

world this is something that you're

Time: 225.78

seeing all the time as well right it's

Time: 228.06

gone are the days where a paper just has

Time: 230.879

three to four names on it right nowadays

Time: 232.799

the machine learning papers have 20

Time: 235.019

names on them because

Time: 237.06

there's a whole Squad of people a whole

Time: 239.64

team of people that are required to

Time: 242.099

train these giant Foundation models so

Time: 245.519

maybe we're going to see a change in

Time: 247.379

Just the Way Machine learning research

Time: 248.879

works and rather than reading uh 10

Time: 251.159

papers that are each kind of like

Time: 253.5

largely spearheaded by one person and

Time: 255.959

there's maybe like a handful of people

Time: 257.28

on the thing

Time: 258.78

you're going to be reading machine

Time: 259.739

learning papers uh with 20 names on them

Time: 263.16

right which is kind of what we're seeing

Time: 267.78

um so let's get started here recent

Time: 271.02

breakthroughs in natural language

Time: 272.16

processing for model pre-training in

Time: 273.84

large quantities have opened the way for

Time: 276.12

similar Foundation models in computer

Time: 277.74

vision yeah that's kind of the key word

Time: 280.139

there

Time: 281.04

these models could greatly simplify the

Time: 283.68

use of images in any system by producing

Time: 285.479

all-purpose visual features right you

Time: 288.3

want to have an encoder that

Time: 291.3

give it an image and it'll give you a

Time: 293.46

feature Vector an embedding that is

Time: 295.259

useful for any task you could want

Time: 297.72

segmentation classification bounding box

Time: 300.78

detection potentially some kind of weird

Time: 302.82

regression

Time: 304.74

features that work across image

Time: 306.6

distributions and tasks without fine

Time: 308.699

tuning

Time: 310.38

yeah you don't want to you don't need to

Time: 312.12

push any gradients into this giant

Time: 313.68

encoder

Time: 315.72

this work shows that existing

Time: 317.22

pre-training methods especially

Time: 318.419

self-supervised methods can produce such

Time: 320.1

features if trained on enough curated

Time: 322.259

data from diverse sources so curated

Time: 325.38

data from diverse sources right

Time: 328.08

the word curated there is a little bit

Time: 330.06

interesting right that means they're

Time: 331.139

doing a lot of data cleaning a lot of

Time: 332.94

data prep and this is something we saw

Time: 334.44

out of open AI as well right where they

Time: 336.66

actually have a whole team of people

Time: 338.46

internally that clean the text Data

Time: 341.759

right so if cleaning the text data is

Time: 344.699

important I think cleaning the image

Time: 346.68

data is even more important

Time: 348.6

because the distribution of kind of

Time: 351

image

Time: 352.08

data is

Time: 354.479

I don't I don't know if I want to say

Time: 356.28

broader and more varied than text data

Time: 358.56

but I would I would kind of make that

Time: 361.02

statement

Time: 364.08

uh we revisit existing approaches

Time: 366.06

combine different techniques and scale

Time: 367.979

our pre-training in terms of data a

Time: 369.419

model and size

Time: 372.12

so maybe one interesting thing here is

Time: 374.16

that openai is not going to tell us

Time: 375.84

about all the different tricks and

Time: 378

techniques that they use for

Time: 379.74

pre-training and training their large

Time: 381.36

Foundation models because they're so

Time: 383.039

afraid of competitors

Time: 384.9

that they don't want to release that but

Time: 387.24

meta Is Not Afraid right and

Time: 390.539

whatever techniques they use here uh for

Time: 393.539

training these large models on huge uh

Time: 396.24

these huge parallel setups of like

Time: 398.22

multi-gpus in these server racks

Time: 402.36

those are probably very very similar to

Time: 404.52

the techniques that openai is using to

Time: 406.5

train their llms right there's probably

Time: 408.12

the same kind of tricks so

Time: 411.479

looking at these type of papers where

Time: 413.639

they're trading these Vision Foundation

Time: 414.9

models might be a way to kind of Intuit

Time: 416.94

what openai is doing when they're

Time: 419.1

training their uh text Foundation models

Time: 421.44

so a little tip there

Time: 424.88

uh technical contributions aim at

Time: 427.199

accelerating and stabilizing the

Time: 428.699

training at scale right lots of

Time: 430.08

different types of regularization maybe

Time: 431.88

Time: 436.039

evaluation kind of like weaved in there

Time: 440.16

we propose an automatic pipeline to

Time: 441.9

build a dedicated diverse and curated

Time: 443.4

image data set instead of uncurated data

Time: 445.56

as typically done in a self-supervised

Time: 447.72

literature

Time: 449.819

curation is important

Time: 452.58

in terms of the models we train a vit

Time: 454.319

model with 1 billion parameters this is

Time: 457.38

a huge Vision Transformer

Time: 460.5

this is no joke like you probably can't

Time: 463.08

even fit this on your consumer GPU and

Time: 465.96

distill it into a series of smaller

Time: 467.639

models right

Time: 468.96

so this is kind of interesting I feel

Time: 470.759

Time: 471.599

maybe previously what you would have

Time: 473.28

seen is that they would have traded

Time: 474.72

multiple smaller or multiple models of

Time: 477.72

different sizes but now

Time: 480.78

kind of distillation has gotten to the

Time: 483.06

point where it's pretty good and there's

Time: 484.62

a good set of Tricks associated with

Time: 486

model distillation model distillation is

Time: 487.68

where you take a larger model and then

Time: 489.18

you train a smaller model to basically

Time: 491.4

mimic the bigger model right you give

Time: 494.46

something to the bigger model the bigger

Time: 495.78

model produces the output and then you

Time: 497.28

say Okay small model here is the input

Time: 500.22

to the big model and the output to the

Time: 501.72

big model just copy that right

Time: 505.74

so if you're trying to save on a bunch

Time: 508.02

of training and compute budget this

Time: 509.759

actually seems like a very good

Time: 510.9

technique right train the very very big

Time: 512.7

model with your huge data set with all

Time: 514.979

your tricks all the like kind of

Time: 516.419

regularization all that stuff and then

Time: 519

take that big model and then from it

Time: 521.399

distill a bunch of smaller models

Time: 523.979

which is actually probably

Time: 526.26

much more common that we think it is I

Time: 528.779

think probably open AI does this as well

Time: 530.58

right I bet you that whenever you're

Time: 532.68

using chat gbt right you're not actually

Time: 534.54

using the large chat GPT model you're

Time: 538.5

using some kind of distilled model that

Time: 541.62

has been uh kind of chosen to fit

Time: 544.44

exactly into whatever one inference GPU

Time: 548.1

that like

Time: 550.44

is designed for user requests right

Time: 557.22

because the one billion parameter model

Time: 559.62

is just that's going to be too huge you

Time: 562.56

having that in a in a kind of like a

Time: 564.839

model serving context where people can

Time: 566.88

like send uh requests to that model it's

Time: 570.6

you you'd be very complicated to have

Time: 573.48

that available

Time: 574.98

but these little tiny models that you

Time: 576.779

can fit on one GPU right that's a lot

Time: 580.2

easier to do

Time: 582.959

open clip

Time: 586.5

uh-oh is this the end of clip no because

Time: 589.92

this isn't text right

Time: 594

learning task agnostic pre-trained

Time: 596.1

representations have become the standard

Time: 597.779

in natural language processing

Time: 601.08

one can use these features as they are

Time: 603.06

IE without fine tuning

Time: 605.279

this is key here right

Time: 608.339

and Achieve performance on Downstream

Time: 610.26

tasks that are significantly better than

Time: 611.519

those produced by task specific models

Time: 613.2

the success has been fueled by

Time: 615.06

pre-training a large quantities of raw

Time: 616.74

tax using pre-tax objectives

Time: 619.019

such as language modeling or word

Time: 620.88

vectors that require no supervision

Time: 626.519

and the word raw here is kind of

Time: 629.1

misleading right I think part of what

Time: 630.839

they're going to talk about in this

Time: 631.92

paper and I think it's going to be a

Time: 633.06

huge theme is the curation right where

Time: 636.54

you can just train on huge raw data sets

Time: 639.839

that you just scraped off the internet

Time: 640.98

but you the quality of that of those

Time: 643.68

images is just not going to be good

Time: 645

enough and you you want to curate that

Time: 646.56

data set so

Time: 647.82

I suspect that a huge section of this is

Time: 650.279

going to be the curation

Time: 654.019

uh following this Paradigm Shift we

Time: 656.22

expect similar Foundation models

Time: 658.2

to appear in computer vision

Time: 660.899

these models should generate visual

Time: 662.339

features that work on the box

Time: 664.44

work out of the box on any task

Time: 667.2

most promising efforts towards these

Time: 669.3

foundational models focus on text guided

Time: 671.339

pre-training

Time: 673.86

so this is what clip does right clip has

Time: 677.3

both text and image features are being

Time: 680.579

projected into the same kind of

Time: 681.839

embedding space

Time: 683.16

but it doesn't seem like that's what

Time: 684.779

they're going to be doing here I think

Time: 685.74

this is an image only model

Time: 690.12

this form of text guided pre-training

Time: 692.1

limits the information that can be

Time: 693.3

retained about the image since captions

Time: 695.1

only approximate the rich information

Time: 696.42

and images and complex pixel level

Time: 698.7

information

Time: 700.079

may not surface with this supervision

Time: 707.94

this is kind of interesting they're

Time: 709.14

saying that there's some limit

Time: 711.54

to uh text and image modalities if you

Time: 715.26

train on both text and image which is

Time: 717.18

what clip does they're saying that

Time: 719.579

you're going to lose out on some of the

Time: 721.8

signal that you're going to get right

Time: 723

specifically this pixel level

Time: 724.38

information

Time: 727.62

that's kind of cool right I feel like

Time: 729.779

that's also contrarian right most people

Time: 731.579

would say that hey if you have text and

Time: 733.2

image it's better to trade on both

Time: 734.64

modalities and the features you get out

Time: 736.32

of that are going to be more rich but

Time: 738.42

here you have people saying that no the

Time: 740.16

text is just a distraction you're better

Time: 741.72

off just training purely on the image

Time: 745.74

we compute a PCA PCA is a principal

Time: 749.04

component analysis it's a kind of it's a

Time: 751.5

form of dimensionality reduction uh I

Time: 755.04

would call it like a classic machine

Time: 756.6

learning algorithm and it's a way to

Time: 758.459

take kind of like a high dimensional uh

Time: 761.16

vector

Time: 763.019

space I guess and compress it into

Time: 765.42

usually three dimensions so that you can

Time: 767.1

visualize it so

Time: 769.8

Time: 771.959

here we go

Time: 773.459

so if you have some data set like this

Time: 776.04

right you can identify the principal

Time: 778.38

component right and the principal

Time: 779.639

components are going to be defined by

Time: 781.26

the some kind of dimension in which you

Time: 783.779

have High variance or low variance right

Time: 786.6

so here

Time: 787.74

this kind of smear of data right you can

Time: 790.68

say okay well the most variants exist

Time: 793.44

kind of along this axis and then this

Time: 795.12

axis so these two must be the the two

Time: 797.339

principal components of it right and you

Time: 799.5

can you're not limited to like two

Time: 801

Dimensions or three dimensions you can

Time: 802.2

kind of do it in any amount of

Time: 803.399

dimensions

Time: 804.72

foreign

Time: 806.339

show the first three components each

Time: 808.2

component is matched to a different

Time: 810.18

color Channel

Time: 811.38

same parts are matched between related

Time: 813

images despite changes of pose style or

Time: 815.339

even objects

Time: 818.519

huh

Time: 820.079

so this is actually really cool so when

Time: 822.18

you look at this it kind of looks like

Time: 823.44

some kind of segmentation right like a

Time: 825.3

pose detector model right like these uh

Time: 827.7

pose Nets dense pose

Time: 831.42

right this is like work that people did

Time: 833.639

for a while where it's like it's

Time: 834.899

basically

Time: 836.339

segments out the head and the body and

Time: 838.62

the legs as different parts but the

Time: 840.899

interesting thing here is that this

Time: 841.92

isn't that right this is just PCA on the

Time: 845.339

features right you they fed these images

Time: 848.1

into their Giant image encoder they get

Time: 850.98

a vector of features they do PCA on

Time: 853.26

those features and then it turns out

Time: 854.88

that the representation of the elephant

Time: 858

head is

Time: 860.1

the same kind of thing as the Eagles

Time: 862.32

Wings right and I guess not here here

Time: 864.959

they're showing you that all four of

Time: 866.579

these elephants right which one of them

Time: 868.5

isn't even a picture of an elephant it's

Time: 870

like a picture of a statue of an

Time: 871.56

elephant the head of the elephant

Time: 874.2

is the same right you see this green

Time: 875.94

color on all of it so it's almost like

Time: 878.04

it's implicitly learned this notion of

Time: 881.459

different parts of the animal which is

Time: 884.04

kind of crazy actually look at

Time: 885.899

here this one's even more impressive

Time: 887.579

right look at this there's this is like

Time: 889.199

an overhead shot

Time: 890.699

of a bunch of horses on a field

Time: 893.639

and each of them is

Time: 896.76

has the same exact kind of uh coloring

Time: 899.519

as the individual pictures of horses

Time: 901.32

which is crazy right because these are

Time: 903.18

so tiny

Time: 904.92

so we can understand scale quite well as

Time: 907.32

well

Time: 909.24

that's cool

Time: 915.36

an alternative to text guided

Time: 916.92

pre-training is self-supervised learning

Time: 921.24

where features are learned from images

Time: 922.98

alone these approaches are conceptually

Time: 925.26

closer to pretext tasks such as language

Time: 927.12

modeling and can capture information at

Time: 928.74

the image and pixel level

Time: 931.62

however despite their potential to learn

Time: 933.24

all-purpose features most of these

Time: 934.5

advances in self-supervised learning

Time: 935.94

were made in the context of pre-training

Time: 937.32

on small curated data sets imaged at 1K

Time: 940.44

right imagenet good old imagenet 1K

Time: 943.139

right in the 1K there refers to 1 000 uh

Time: 945.959

classes it's a classification model

Time: 949.26

some efforts on scaling these approaches

Time: 950.94

have been attempted but they focused on

Time: 953.279

uncurated data sets which typically lead

Time: 955.199

to a significant drop in quality

Time: 957.899

right uncurated significant drop in

Time: 961.199

quality

Time: 962.94

this is explained by the lack of control

Time: 964.8

over the data quality and diversity

Time: 966.24

which are essential to produce good

Time: 967.8

features

Time: 970.32

I agree with this but I also think that

Time: 972.12

this is a phase right I think that we're

Time: 974.639

currently in a phase where training on a

Time: 976.86

very very large curated data set is

Time: 978.839

better than training on a slightly

Time: 980.579

larger uncurated data set but I feel

Time: 983.04

like the

Time: 984.36

the rich Sutton kind of bitter lesson is

Time: 987.12

going to come back and what we're going

Time: 988.56

to see is that

Time: 990.54

I feel like in the future the most the

Time: 993.779

biggest possible uncurated data set is

Time: 996.06

actually going to be the best way to do

Time: 997.259

it so I think this curating the data

Time: 1000.56

sets is working right now for this

Time: 1002.839

particular uh generation of foundation

Time: 1005.3

models but I suspect that in the future

Time: 1007.36

the scale will just beat out

Time: 1010.339

and maybe there's some ways that you

Time: 1012.259

could basically weigh you could like

Time: 1014.6

take the image and then figure out

Time: 1017.48

whether it's a high quality image or a

Time: 1019.16

low quality image with a model itself

Time: 1020.72

and then basically do some kind of like

Time: 1022.48

pseudo-labeling in that way

Time: 1026.12

we explore if self-supervised learning

Time: 1027.74

has the potential to learn all-purpose

Time: 1029.179

visual features if pre-trained a large

Time: 1030.62

quality we revisit existing

Time: 1032.24

discriminatives self-supervised

Time: 1034.22

approaches

Time: 1035.54

okay

Time: 1037.1

that learn features about the image and

Time: 1039.559

Patch level both the image and Patch

Time: 1041.78

level so these are Vision Transformers

Time: 1043.88

right so they cut up the image into

Time: 1045.38

these little patches

Time: 1047.24

foreign we reconsider some of the design

Time: 1049.64

choices under the lens of a larger data

Time: 1051.919

set most of our technical contributions

Time: 1053.24

are tailored towards stabilizing and

Time: 1055.22

accelerating discriminative

Time: 1056.62

self-supervised learning

Time: 1058.94

so basically it's all these bags of

Time: 1060.799

tricks that you use whenever you're

Time: 1062.48

training these huge models

Time: 1065.12

so here are some numbers here two times

Time: 1067.76

faster

Time: 1069.32

and three times less memory

Time: 1071.78

which is huge

Time: 1075.2

and larger batch sizes this is key so

Time: 1078.38

uh very old paper at this point but a

Time: 1081.559

paper that is definitely a seminal work

Time: 1083.36

in machine learning is that don't

Time: 1084.799

decrease the learning rate increase the

Time: 1086.6

batch size

Time: 1087.86

and basically what they describe in this

Time: 1089.9

paper is that

Time: 1091.28

larger batch sizes are just inherently

Time: 1093.919

better right because the bigger the

Time: 1096.62

batch size the more kind of stable the

Time: 1099.2

gradient or the kind of direction is

Time: 1101.12

right like if you just have a couple

Time: 1102.44

images if you have a small batch size

Time: 1104.559

then the direction you take in that

Time: 1107.78

gradient step that results from that

Time: 1109.46

batch can be kind of noisy right and you

Time: 1111.74

can end up taking kind of steps that

Time: 1113.299

kind of move around for no like there's

Time: 1115.28

a lot of noise in them right but if you

Time: 1117.08

have a very large batch the average kind

Time: 1120.02

Time: 1120.74

the direction that you end up getting

Time: 1122.179

when you take a gradient step from that

Time: 1124.1

large batch is more in line with the

Time: 1126.679

entire data set right so you kind of end

Time: 1128.24

up taking it like a straighter line

Time: 1129.86

through this lost landscape

Time: 1135.2

regarding free training data we have

Time: 1136.76

built an automatic pipeline to filter

Time: 1138.38

and rebalance data sets

Time: 1141.14

so this is similar to what we saw in the

Time: 1143.059

segment anything paper where basically

Time: 1144.86

they have this human in the loop kind of

Time: 1147.98

like semi-automatic

Time: 1150.2

pipeline where

Time: 1152

kind of like initially the humans are

Time: 1154.039

labeling and then the system labels more

Time: 1155.9

and more and then the humans are just

Time: 1157.16

kind of like uh confirming and uh

Time: 1161.179

checking to make sure that the system is

Time: 1163.52

labeling correctly

Time: 1166.1

data similarities are used instead of

Time: 1167.96

external metadata metadata and do not

Time: 1169.64

require manual annotation

Time: 1172.58

a major difficulty when dealing with

Time: 1174.02

images is to rebalance Concepts and

Time: 1176.48

avoid overfitting on a few dominant

Time: 1178.82

modes

Time: 1181.34

okay interesting so you have this kind

Time: 1183.26

of mode collapse potentially where

Time: 1185.84

there's specific kind of solutions that

Time: 1190.1

the uh

Time: 1191.66

neural net can end up in that are like

Time: 1193.82

kind of good for solving the thing but

Time: 1195.74

are just mostly at local Minima

Time: 1198.5

or a local Maxima depending on what you

Time: 1200.96

want to measure

Time: 1204.14

we gathered a small but diverse Corpus

Time: 1206.179

of 142 million images just just a little

Time: 1209.419

small data set you know just a 142

Time: 1212.419

million images just casual

Time: 1215.96

we provide a variety of pre-trained

Time: 1218.36

visual models called Dyno V2 trained

Time: 1220.46

with different Vision Transformers

Time: 1222.74

architectures on our data

Time: 1225.26

we released all the models

Time: 1228.5

you know open AI like they don't release

Time: 1230.78

and then meta here is releasing

Time: 1232.82

everything so

Time: 1234.919

meta is more open than open AI which is

Time: 1238.039

kind of weird to think about

Time: 1240.62

we validate the quality of dynode V2 on

Time: 1243.44

various computer vision benchmarks so of

Time: 1245.84

course you're going to see some kind of

Time: 1248.08

imagenet potentially Coco

Time: 1251.9

we conclude that self-supervised

Time: 1254

pre-training alone is a good candidate

Time: 1255.559

for learning transferable Frozen

Time: 1256.94

features so Frozen here and transferable

Time: 1262.48

the idea here is that you don't want to

Time: 1265.46

uh fine-tune the feature encoder

Time: 1267.799

sometimes when you take a pre-trained

Time: 1270.74

encoder such as an imagenet encoder you

Time: 1273.5

don't actually you you don't want to

Time: 1274.88

freeze it right you you want to let some

Time: 1276.919

of your gradients flow through it

Time: 1278.539

because like that it becomes more

Time: 1280.34

adapted to your task but we're moving

Time: 1283.64

into a world where these pre-trained

Time: 1286.22

encoders are just so powerful that if

Time: 1289.28

you push gradients into them they're

Time: 1290.84

just going to become worse so in that

Time: 1293.24

Paradigm you would basically never

Time: 1295.36

freeze your feature encoder

Time: 1298.34

right and the feature encoder is uh this

Time: 1300.559

thing here right this giant

Time: 1302.36

backbone is another word that you can

Time: 1304.64

use to call it

Time: 1307.58

that are competitive with the best

Time: 1309.08

openly available weekly supervised

Time: 1310.94

models

Time: 1312.44

competitive so they didn't get state of

Time: 1314.72

the art they got competitive

Time: 1320.78

intra image self-supervised training

Time: 1324.5

okay so here you have different related

Time: 1326.299

works

Time: 1334.039

this paper is actually pretty long and I

Time: 1335.6

do have a heart out so I'm probably

Time: 1336.919

going to scroll through some of this

Time: 1337.88

overview of our data processing pipeline

Time: 1341.419

images from a curated and uncurated data

Time: 1344

source are first mapped to embeddings

Time: 1346.76

uncreated images are then deduplicated

Time: 1348.799

before matched into a curated images

Time: 1351.44

the resulting combination augments the

Time: 1353.179

initial data set through self-supervised

Time: 1354.799

retrieval system

Time: 1356.659

okay

Time: 1357.799

so they have

Time: 1359.78

uh in these kind of like block diagrams

Time: 1363.5

a lot of times the data right the

Time: 1365.78

database is like kind of like shown as a

Time: 1368.36

cylinder

Time: 1370.46

so here they're showing you they have

Time: 1372.26

this giant database of

Time: 1376.1

on curated data and curated data and

Time: 1378.44

obviously the size here shows you that

Time: 1380.059

there's much more uncurated data than

Time: 1381.5

there is curated data then they take all

Time: 1384.08

of these images and they feed them

Time: 1385.4

through some feature encoder which is

Time: 1387.5

probably just going to be a version of

Time: 1389.36

the feature encoder that they're using

Time: 1391.28

and it'll give you an embedding right

Time: 1393.74

and you can use the similarity between

Time: 1396.799

those embeddings to compare the uh

Time: 1400.039

different images right so the same way

Time: 1402.5

that people use Vector databases to

Time: 1404.84

store text in such a way that you can

Time: 1407.78

use similarity to compare different

Time: 1410.299

parts of text which is super hot in the

Time: 1412.82

llm space right now you can do the same

Time: 1415.28

kind of thing with images right you

Time: 1416.9

could you could embed all your images

Time: 1419.419

and then you can use similarity between

Time: 1421.64

the embedded images in order to find

Time: 1423.799

similar images

Time: 1425.299

which is what's going on here right

Time: 1427.12

deduplication is basically just saying

Time: 1429.14

hey these two images have almost

Time: 1430.52

identical embeddings therefore let's get

Time: 1432.26

rid of it and retrieval here is the uh

Time: 1435.86

idea of like okay let's go and give me

Time: 1438.08

images that are very very similar in

Time: 1439.82

their embeddings to this image here and

Time: 1442.159

that's what you get there

Time: 1444.86

so this kind of self-supervised

Time: 1447.26

retrieval system where the system can go

Time: 1449.299

into the database find the images that

Time: 1451.7

are most similar to the one that it

Time: 1453.02

currently has and then maybe curate its

Time: 1455.96

own little mini batch that it trades on

Time: 1460.88

growing body of work is focused on

Time: 1462.559

scaling the abilities of self-supervised

Time: 1464.48

pre-training and model size

Time: 1474.94

automatic data curation

Time: 1478.1

okay data processing we assemble our

Time: 1481.34

curated

Time: 1482.679

lvd-142 Mill data set so this is the

Time: 1485.659

actual data set that they use

Time: 1487.52

and again like props to them for

Time: 1489.98

actually naming it you know and like

Time: 1492.02

make like telling you what they're

Time: 1494.12

training on it's not just like a

Time: 1496.52

hey we trained on a data set but we're

Time: 1498.32

not going to even tell you what it is or

Time: 1499.76

how big it is or anything like that

Time: 1501.2

right they do tell you how big it is and

Time: 1502.64

they do tell you kind of how they got it

Time: 1506

uh images that are close to those in

Time: 1509

several curated data sets we describe

Time: 1510.62

below the main component in our data

Time: 1512.059

pipeline including the curated Dash

Time: 1514.64

uncurated data sources

Time: 1516.799

the image deduplication step and the

Time: 1518.539

retrieval system so there's a couple

Time: 1520.46

different components of here you have

Time: 1521.78

the curation and on curation you have

Time: 1523.82

deduplication and then you have

Time: 1525.44

retrieval

Time: 1527.659

does not require any metadata

Time: 1529.94

or text so they're not training a clip

Time: 1533.419

right they're not training something

Time: 1534.98

that has knowledge of both text and

Time: 1536.659

image this is a pure image

Time: 1539.659

Foundation model

Time: 1543.26

uh is detailed in the appendix and

Time: 1545.24

contains imagenet22k the train split of

Time: 1547.58

imagenet 1K Google landmarks and several

Time: 1549.74

fine-grained data sets so this is the

Time: 1551.299

actual

Time: 1552.14

components of

Time: 1554.179

their lvd 142 mil

Time: 1559.9

we collect a raw unfiltered data set

Time: 1563.12

from a publicly available repository of

Time: 1565.76

crawled image data

Time: 1567.62

okay

Time: 1569.419

we extract URL links of images

Time: 1573.799

we discard URLs that are unsafe or

Time: 1576.5

restricted by domains and post-processes

Time: 1578.299

the downloaded images PCA hash

Time: 1580.4

deduplication

Time: 1582.26

NSFW filtering and blurring identifiable

Time: 1586.039

faces

Time: 1587.299

I wonder how much they're missing out on

Time: 1588.98

you know like in these type of

Time: 1591.44

like how much of all the images in the

Time: 1594.02

internet are like not safe for work

Time: 1596.12

right

Time: 1598.279

it's probably a lot there's probably a

Time: 1599.96

huge chunk of

Time: 1601.88

of image data that they could be trading

Time: 1604.34

Time: 1605.419

right so who's going to be brave enough

Time: 1606.98

to train on all the porn data

Time: 1610.82

that's the real question

Time: 1614.059

we apply the copy detection pipeline of

Time: 1616.46

pissy to the uncurated data

Time: 1619.1

and recover remove near duplicate images

Time: 1621.679

so this is where they're doing the

Time: 1623.299

similarity of the embeddings this

Time: 1625.279

reduces redundancy and increases

Time: 1626.659

diversity among images

Time: 1630.2

we also remove near duplicates of images

Time: 1632.48

contained in the tester validation set

Time: 1635.659

of any Benchmark used in this work

Time: 1638.419

a lot of filtering going on here

Time: 1640.4

we build our curated dating sit curated

Time: 1642.679

pre-trading data set by retrieving

Time: 1644.6

images from our uncurated data source

Time: 1646.76

that are closed to the images in our

Time: 1648.14

curated sources

Time: 1650.12

we first compute an image embedding

Time: 1651.98

using a self-supervised vith network

Time: 1654.62

pre-trained on imagenet 22k

Time: 1657.44

and then use cosine similarity as a

Time: 1659.96

distance measure between these two

Time: 1661.039

images so

Time: 1663.2

this is interesting I thought that they

Time: 1664.76

would have basically the way that they

Time: 1667.279

embedded these images they would have

Time: 1668.72

used the model itself and then kind of

Time: 1670.94

constantly updated that model to get

Time: 1673.039

better and better and better in beddings

Time: 1674.659

but it actually sounds like what they're

Time: 1676.46

using to create these embeddings for the

Time: 1679.22

retrieval and deduplication is actually

Time: 1681.02

just a pre-trained vision coder huge so

Time: 1684.32

via the H here means huge so it's a

Time: 1686.48

bigger one right there's different sizes

Time: 1687.86

you have vit small vit base vit large

Time: 1692.48

and then vit huge

Time: 1694.88

and the 16 here refers to the fact that

Time: 1697.4

this particular Vision Transformer cuts

Time: 1699.32

the image into 16 patches right so 4x4

Time: 1701.36

patch

Time: 1704

and then uh cosine similarity is a

Time: 1706.82

measure of similarity between two

Time: 1708.98

vectors right

Time: 1710.84

so it tells you how similar two vectors

Time: 1712.7

are so two vectors that are basically

Time: 1714.2

like this very high cosine similarity

Time: 1716.179

two vectors that are like that pointing

Time: 1718.52

in opposite directions very low cosine

Time: 1720.44

similarity

Time: 1722.96

uh given a query data set if it is large

Time: 1725.179

enough to retrieve n typically four

Time: 1727.279

nearest neighbors for each query image

Time: 1731.299

if it is small we sample M images from

Time: 1733.7

the cluster corresponding to each query

Time: 1735.08

image we adjust n and M by visual

Time: 1736.88

inspection of the retrieval result

Time: 1740.179

they probably have an internal Vector

Time: 1741.919

database right like this to me screams

Time: 1743.779

internal Vector database

Time: 1750.38

the deduplication and retrieval stages

Time: 1752.179

of our pipeline rely on the face Library

Time: 1754.52

fast

Time: 1757.039

AI search image search something like

Time: 1759.44

that

Time: 1760.52

to efficiently index and compute batch

Time: 1762.559

searches of its nearest embeddings

Time: 1767.299

foreign

Time: 1771.1

oh I remember this yeah

Time: 1776.539

yeah this is like the internal Facebook

Time: 1780.38

similarity search

Time: 1783.5

written in C plus with complete wrappers

Time: 1785.72

for Python and numpy someone needs to

Time: 1788

rewrite this in Rust

Time: 1791.12

every time I see C plus plus now I'm

Time: 1792.86

Time: 1793.88

they should rewrite it in Rust but I

Time: 1795.98

feel like rust developers are like

Time: 1797.48

extremely hard to find because it takes

Time: 1799.58

a special type of person to learn rust

Time: 1804.44

uh we heavily leverage its support for

Time: 1806.6

GPU accelerated indices using inverted

Time: 1808.76

file indices with product quantization

Time: 1811.34

codes

Time: 1813.08

yeah so these are all the different

Time: 1814.159

tricks you can use to basically make

Time: 1815.84

similarity search faster

Time: 1818.419

the whole processing is distributed in a

Time: 1820.039

compute cluster of 20 nodes equipped

Time: 1821.899

with like look at these monsters here

Time: 1824.539

you have

Time: 1826.22

Time: 1827.96

nodes right and each node has eight v

Time: 1832.22

132 gigabyte gpus like

Time: 1835.279

holy like this is these are Monster

Time: 1837.74

systems right

Time: 1839.48

and

Time: 1840.919

you know I stand in awe of these kind of

Time: 1842.96

systems because they're very powerful

Time: 1844.7

but also it makes me a little bit sad

Time: 1846.559

right because

Time: 1848.539

I feel like five years ago you read a

Time: 1850.22

machine learning paper and and they were

Time: 1851.779

like oh we trained this on like a GPU on

Time: 1854.659

our consumer hardware and it was amazing

Time: 1856.88

because you're like oh that means I

Time: 1858.2

can train that right but I feel like

Time: 1861.02

every single paper that I read now it's

Time: 1862.64

like the hardware that they're using is

Time: 1864.62

just Way Beyond

Time: 1866.48

everything how much do they cost each

Time: 1868.399

let's see

Time: 1869.96

a V100 32

Time: 1873.32

gigabyte

Time: 1878.419

you know this isn't as bad as the a100s

Time: 1880.64

this is like a 3 000 GPU maybe four

Time: 1883.34

thousand dollar GPU

Time: 1885.98

but there's eight of them right so it's

Time: 1889.039

three thousand dollars

Time: 1891.559

times eight and then there's 20 nodes

Time: 1895.82

so the total cost of that training rig

Time: 1898.34

is over it's like basically half a

Time: 1900.26

million dollars

Time: 1902.24

so this 20 node 8 V 132 is a half

Time: 1907.039

million dollar system

Time: 1911.36

we learn our features with a

Time: 1913.22

discriminative self-supervised method

Time: 1915.2

that can be seen as a combination of

Time: 1916.7

Dyno and iBot losses with the centering

Time: 1919.52

Time: 1920.5

swav so here are the different level

Time: 1923.659

different losses right and the the

Time: 1925.46

losses the loss for this is probably

Time: 1927.799

going to be quite complicated there's

Time: 1928.88

gonna be like 10 different uh terms to

Time: 1931.279

Time: 1932.779

right they also have regularization here

Time: 1936.02

uh high resolution training phrase We

Time: 1938.72

rapidly introduce each of these

Time: 1939.98

approaches but more details can be found

Time: 1941.36

in the related papers okay so here are

Time: 1943.46

all these different tricks let's see

Time: 1946.58

Image level objective

Time: 1949.52

we consider the cross entropy loss

Time: 1951.5

between the features extracted from a

Time: 1953.299

student and a teacher Network so in a

Time: 1955.52

distillation framework you have a

Time: 1957.14

teacher Network which is the big one and

Time: 1958.52

then you have the student which is the

Time: 1959.779

small Network

Time: 1961.64

a small model the small neural net

Time: 1964.279

both features are coming from the class

Time: 1966.08

token of a vit

Time: 1968.48

obtained from different crops of the

Time: 1970.82

same image okay so they're cropping the

Time: 1974.419

different parts of the image and then

Time: 1976.399

feeding that into a vision encoder for a

Time: 1979.279

vision Transformer which then gives you

Time: 1981.919

visual tokens right and visual tokens is

Time: 1985.159

just another way to say an embedding for

Time: 1987.32

an image

Time: 1988.94

we learned the parameters of the student

Time: 1990.32

and build a teacher with an exponential

Time: 1991.88

moving average of its past iterates

Time: 1995.36

right so

Time: 1997.519

this we saw yesterday right the

Time: 1999.38

exponential moving average this idea of

Time: 2001.72

having multiple different models that

Time: 2003.279

are all slightly different and then you

Time: 2004.659

basically take the average of all their

Time: 2006.58

weights and that becomes kind of the

Time: 2008.919

the big model

Time: 2011.2

right this is uh something that we that

Time: 2013.539

is much more of a thing in reinforcement

Time: 2015.64

learning

Time: 2016.539

because of the way that reinforcement

Time: 2018.1

learning works and you have to kind of

Time: 2019.419

spread your model in order to gather

Time: 2020.86

experience but

Time: 2022.899

the same kind of like uh distributed

Time: 2026.62

training requirements are resulting in

Time: 2029.32

EMA being more and more popular for

Time: 2031.24

everything that isn't RL

Time: 2034.6

we randomly Mass some of the same input

Time: 2036.46

patches given the student but not the

Time: 2037.899

teacher we then

Time: 2039.94

so here's it's kind of like a drop out

Time: 2041.86

kind of thing right so the vision

Time: 2043.779

Transformer cuts the image into four by

Time: 2045.88

four patches 16 of them and you're going

Time: 2048.879

to basically zero out some of them right

Time: 2050.919

so it's kind of like a drop out

Time: 2052.899

basically

Time: 2054.339

we had a cross entropy loss between the

Time: 2056.44

patch features on both networks on each

Time: 2058.54

Mast patch this Lodge is combined with a

Time: 2061.48

image level loss

Time: 2064.32

untying head weights between both

Time: 2066.399

objectives we observe that tying the

Time: 2068.56

weights associated with both objectives

Time: 2070.179

makes the model under fit at the patch

Time: 2072.339

level while overfitting at the Image

Time: 2074.02

level

Time: 2075.04

untying these weights resolves this

Time: 2076.96

issue and improves the performance at

Time: 2078.58

both scales

Time: 2080.379

Okay so

Time: 2083.74

underfitting at the patch level versus

Time: 2086.02

overfitting at the Image level so

Time: 2088.899

overfitting at the Image level means

Time: 2090.52

that at the highest kind of point of

Time: 2092.56

your model right your model is this kind

Time: 2094.179

of like there's layers that go all the

Time: 2096.22

way from the layers that are close to

Time: 2097.72

the image which the low level features

Time: 2099.339

which are the patches and then you go up

Time: 2101.44

and up and up and up all the way to the

Time: 2102.76

clat like the head right the model head

Time: 2104.8

right so you have the patch level

Time: 2108.099

part of the model which is the bottom

Time: 2109.9

and then you have the head part of the

Time: 2111.579

model which is the top so what they're

Time: 2113.44

saying is that you can actually have

Time: 2115.3

under fitting at the patch level and

Time: 2117.16

overfitting at the Image level which

Time: 2118.42

means that your model head is kind of

Time: 2120.28

over fit to the data right

Time: 2122.859

it it has already memorized the data

Time: 2125.619

because it's smaller the head is smaller

Time: 2127.54

but the patches especially if you have a

Time: 2130.06

huge Vision Transformer there's a lot

Time: 2132.76

more model capacity in there so

Time: 2134.98

they're actually under fit

Time: 2136.96

so this is kind of an interesting uh

Time: 2140.26

uh observation there where when you have

Time: 2142.839

these giant models where the bottom is

Time: 2144.88

just absolutely massive and then you

Time: 2146.5

have like these tiny little

Time: 2147.64

classification heads right that maybe

Time: 2150.099

only have a thousand uh classes at the

Time: 2152.14

top you can end up in a world where your

Time: 2154.48

head is over fit and then the bottom

Time: 2157.06

part of your model is underfit

Time: 2161.68

uh sinkhorn knob centering

Time: 2165.76

recommend to replace the teacher softmax

Time: 2168.28

centering step of Dino and iBot by the

Time: 2171.28

sinkhorn knob batch normalization

Time: 2175.06

foreign

Time: 2176.76

know what the this is but it's

Time: 2179.5

probably seems like it's just some kind

Time: 2181.119

of normalization batch normalization

Time: 2183.22

layer norms and probably

Time: 2186.28

500 IQ like combination of like

Time: 2189.16

normalization at specific layers right

Time: 2192.96

coleo regularizer another uh regularizer

Time: 2196.599

here

Time: 2197.619

differential entropy estimation

Time: 2200.28

encourages a uniform span of the

Time: 2203.02

features within a batch okay so another

Time: 2205.359

type of like

Time: 2206.68

uh batch Norm where normally what bash

Time: 2209.38

Norm is saying is that the actual values

Time: 2212.619

for all the kind of activations in

Time: 2214.839

between layers of for the batch should

Time: 2216.76

be roughly the same right you don't want

Time: 2218.38

uh if you have a batch of 10 images you

Time: 2220.839

don't want one of the images to have

Time: 2222.28

like super high uh activations and then

Time: 2224.8

everything else in the batch is

Time: 2225.94

basically just zero you want to like

Time: 2228.22

kind of normalize them in such a way

Time: 2230.02

that all of them have a little bit of

Time: 2232.359

signal right so you can have a little

Time: 2233.92

bit more information coming through

Time: 2236.8

Time: 2238.78

this is kind of the same idea

Time: 2241.54

given a set of n vectors

Time: 2245.38

you have the loss coleo so this is right

Time: 2249.099

L this fancy script L just means it's a

Time: 2251.44

loss function right so you want this to

Time: 2254.079

be lower

Time: 2255.52

you have negative 1 over n and the

Time: 2258.16

summation from I equals 1 to n so this

Time: 2260.32

just means an average an average of the

Time: 2262.24

log of dni where dni is the minimum

Time: 2266.859

x i minus x j

Time: 2270.16

so if you have n vectors X1 to xn right

Time: 2273.099

and there's going to be

Time: 2274.66

a batch is going to be a set of vectors

Time: 2278.56

right if you have a batch of 10 images

Time: 2280.54

you feed them through your image encoder

Time: 2282.52

you're going to get a set of or a batch

Time: 2284.44

of 10 vectors and then they're saying

Time: 2287.2

okay

Time: 2288.339

the difference between the minimum

Time: 2289.96

difference between these vectors

Time: 2292

and then I want the minimum difference

Time: 2293.74

between these vectors the absolute

Time: 2295.18

difference here right that's what these

Time: 2296.26

little double bars mirror mean and this

Time: 2298.78

is L1 L1 loss right

Time: 2301.42

the log of that sum it over all the

Time: 2304.72

batch get the average That's the Law so

Time: 2306.579

it's like an extra regularization term

Time: 2309.94

we also L2 normalize the features before

Time: 2312.099

Computing this regularizer

Time: 2314.859

okay so a lot of fancy uh regularization

Time: 2318.04

and normalization going on here

Time: 2323.2

uh adapting the resolution increasing

Time: 2325.66

image resolution is key to pixel level

Time: 2327.339

Downstream tasks such as segmentation or

Time: 2329.56

detection where small objects disappear

Time: 2332.079

at low resolutions

Time: 2336.46

yeah this is kind of interesting because

Time: 2340.42

one thing that we saw right it's in this

Time: 2342.64

paper here at the very beginning is uh

Time: 2344.98

this picture here of these horses like

Time: 2348.16

these horses are absolutely tiny

Time: 2351.04

right like look at this picture it's

Time: 2352.54

like an overhead image of like 50 horses

Time: 2355.18

on a Green Field and it's picking out

Time: 2357.7

the individual horses right so if you

Time: 2360.28

were to like down sample all these

Time: 2362.859

images to like a 256 by 256 or something

Time: 2365.68

you would lose all of that right you

Time: 2367.9

would no longer have

Time: 2370.24

the ability to like kind of pick out

Time: 2372.579

tiny things in large high resolution

Time: 2374.32

images so how do they do that

Time: 2377.38

so the way that they do that is that

Time: 2379.24

they train at high resolution which is

Time: 2381.76

time and memory demanding

Time: 2385.18

and instead they increase the resolution

Time: 2386.98

of images to 518 by 518 during a short

Time: 2389.619

period at the end of pre-training

Time: 2393.46

okay so they have like a schedule here

Time: 2395.38

right

Time: 2396.579

or a curriculum is another word for this

Time: 2398.5

right where you have trading on a

Time: 2400.359

specific data set at the beginning and

Time: 2402.64

then you train on a different data set

Time: 2404.26

afterwards right you have a curriculum

Time: 2406.24

that you that you use

Time: 2408.46

and here the curriculum is low

Time: 2410.98

resolution images and then high

Time: 2412.66

resolution images

Time: 2416.38

we consider several improvements to

Time: 2418.54

train the models we train models on

Time: 2419.92

a100s using pytorch 2.0

Time: 2422.5

that's pretty cool they're using kind of

Time: 2424.06

the bleeding edge

Time: 2425.56

what did I just do I just accidentally

Time: 2427.48

went all the way down

Time: 2433.119

Time: 2434.619

the code is available along with

Time: 2436

pre-trained models used for feature

Time: 2437.5

extraction

Time: 2440.2

so the pre-trained model that I assume

Time: 2442.78

they used to uh get the embeddings that

Time: 2445.54

they use for the similarity search

Time: 2446.859

that's probably what they mean by this

Time: 2449.14

one

Time: 2450.46

with the same Hardware compared to the

Time: 2452.2

iBot implementation the dyno V2 code

Time: 2454.3

runs around two times faster

Time: 2457.66

and only one third of the memory

Time: 2462.94

fast and efficient and memory efficient

Time: 2465.16

attention yeah so

Time: 2466.9

one of the biggest problems with

Time: 2468.16

Transformers is because they basically

Time: 2471.22

multiply every vector by every other

Time: 2473.619

Vector like the length of your sequence

Time: 2475.42

is actually determining the size of the

Time: 2478.78

overall memory right so if you have very

Time: 2481.359

small sequence your memory footprint is

Time: 2483.22

going to be small but as soon as you

Time: 2484.48

increase the the sequence length right

Time: 2487.3

the memory grows quadratically with that

Time: 2489.88

Time: 2490.839

Transformers are very

Time: 2493.18

memory hungry right there's some

Time: 2496.18

techniques that people have come up with

Time: 2497.74

to reduce the amount of memory that

Time: 2499.54

Transformers take but it's still it

Time: 2501.7

could still be pretty bad so let's see

Time: 2503.74

what they uh do here we Implement our

Time: 2506.8

own version of flash attention to

Time: 2508.359

improve memory usage and speed on the

Time: 2510.099

self-attention layers

Time: 2511.9

our version is on par with or better

Time: 2514.24

than the original on all cases

Time: 2516.46

considered while covering more use cases

Time: 2518.26

and Hardware

Time: 2520.119

due to the GPU Hardware specifics the

Time: 2522.28

efficiency is best when the embedding

Time: 2523.9

Dimension per head is a multiple of 64.

Time: 2526.96

yeah

Time: 2528.099

so this is an important thing to

Time: 2530.44

consider is that

Time: 2531.88

sometimes people don't realize it but a

Time: 2534.7

lot of the hyper parameters for the

Time: 2536.619

model architecture are not even chosen

Time: 2538.66

because they result in better

Time: 2539.92

performance they're chosen because

Time: 2542.7

they're specific to the hardware that

Time: 2545.56

they're trained on right

Time: 2547.42

so Google models are going to be

Time: 2552.16

the hyper parameters for the model

Time: 2553.78

architecture of a model that's trained

Time: 2555.52

at Google is going to be specific to the

Time: 2557.74

size of the TPU that it's being trained

Time: 2559.72

on right

Time: 2560.98

the

Time: 2562.48

model parameter or the model hyper

Time: 2565.3

parameters for the model architecture

Time: 2566.74

that uh Facebook trains are going to be

Time: 2570.099

specific to their a100 gpus so

Time: 2573.48

these Dimensions right the size of the

Time: 2576.4

model head the size of the uh

Time: 2579.28

embeddings that you have within your

Time: 2582.28

vision Transformer all of those are

Time: 2583.66

going to be specific to the hardware

Time: 2587.88

uh Matrix are even better when the full

Time: 2590.319

embedding Dimension is a multiple of 256

Time: 2592.119

as a consequence our vitg architecture

Time: 2594.339

slightly differs from the architecture

Time: 2595.72

proposed in order to maximize compute

Time: 2598.599

efficiency

Time: 2600.52

we use an embedding dimension of 1536

Time: 2603.76

with 24 heads rather than 1406 with 16

Time: 2607

heads

Time: 2608.5

yeah so

Time: 2611.319

bigger model

Time: 2612.94

but also chosen so that works better

Time: 2617.56

and this is like a this paper is booby

Time: 2620.2

trapped with

Time: 2622.42

with uh

Time: 2623.92

references so I can't click

Time: 2628.96

fourteen thousand

Time: 2634.359

our vitg backbone counts 1.1 billion

Time: 2637.78

parameters which is quite big

Time: 2640.96

nested tensors and self-attention our

Time: 2642.819

version also allows running in the same

Time: 2645.4

forward pass the global crop and the

Time: 2647.74

local crop

Time: 2649.06

that have different numbers of patch

Time: 2650.619

tokens

Time: 2652.119

leading to significant compute

Time: 2653.5

efficiency gains can pain compared to

Time: 2655.48

using separate forward and backward

Time: 2656.74

passes as done in Prior implementations

Time: 2662.92

Okay so

Time: 2668.56

basically

Time: 2672.819

how do I describe this

Time: 2679.06

I don't know let's not describe this

Time: 2681.94

efficient stochastic depths

Time: 2685.06

we Implement an improved version of

Time: 2686.859

stochastic depth that skips the

Time: 2688.42

computation of the dropped residuals

Time: 2689.98

rather than masking the result

Time: 2693.4

Time: 2696.16

in these Transformers sometimes you mask

Time: 2699.04

specific parts and also they said that

Time: 2701.26

they were dropping out specific patches

Time: 2703.18

in the Transformer so

Time: 2706.119

whenever you have Dropout in your model

Time: 2708.4

and just Pi torch a lot of times it's

Time: 2710.2

actually still getting calculated and

Time: 2711.76

then it's just getting zeroed out right

Time: 2713.44

so you're actually spending some amount

Time: 2714.819

of compute calculating uh activations

Time: 2717.7

which are just going to get dropped out

Time: 2719.079

so you could probably save by not having

Time: 2722.74

to calculate those

Time: 2729.28

hope you're doing well it will be great

Time: 2730.599

if you could try to do live

Time: 2731.5

implementations from scratch by

Time: 2732.88

referring to the architecture in the

Time: 2734.2

paper

Time: 2735.52

yeah I mean I can try to but I think

Time: 2737.38

it's important to realize that the

Time: 2740.22

implementations of papers

Time: 2742.599

is kind of fading away right

Time: 2745.96

it's possible it was POS it used to be

Time: 2747.88

possible to implement uh machine

Time: 2749.68

learning papers because they were

Time: 2751.18

trained on similar hardware and they

Time: 2753.339

were made by basically one or two people

Time: 2755.8

and a six-month kind of research project

Time: 2757.9

but

Time: 2760.119

this is not that right this is a model

Time: 2762.4

trained on million dollar systems uh

Time: 2765.7

created by teams of 20 people so it's

Time: 2769.24

basically impossible to re-implement

Time: 2770.859

these right you're not going to

Time: 2772.119

re-implement this paper

Time: 2774.04

what you can do is you can take this uh

Time: 2777.4

Vision Transformer and use it in your

Time: 2779.98

own uh technique right you can you can

Time: 2781.839

go and you can download this exact

Time: 2783.76

Vision Transformer and you can use it

Time: 2785.319

for some kind of cool new interesting

Time: 2787.78

thing

Time: 2788.8

or app or or task that you have I think

Time: 2792.22

that's definitely doable you can do that

Time: 2793.839

as a single person but as a single

Time: 2796.48

person it's basically impossible to

Time: 2798.339

re-implement this paper because you

Time: 2799.66

don't

Time: 2800.319

there's just not enough time right

Time: 2801.94

you're not a 20-person team you're not

Time: 2805

going to have a half million dollars to

Time: 2807.099

spend on gpus

Time: 2810

uh this saves memory and compute and

Time: 2813.339

proportion approximately equal to the

Time: 2814.78

drop rate thanks to specific fused

Time: 2816.4

kernels

Time: 2817.359

uh fused kernel so obviously whenever

Time: 2819.76

you create a deep learning uh model

Time: 2823.48

it gets compiled into these Cuda kernels

Time: 2826

which are what actually is running on

Time: 2827.859

your GPU and those Cuda kernels like

Time: 2830.8

joining them together or fusing them as

Time: 2833.44

it's called is

Time: 2835.54

one of the best ways to get better

Time: 2837.339

efficiency and uh

Time: 2839.859

speed or use less compute basically so

Time: 2844.599

that's another

Time: 2846.48

requirement that is driving the model

Time: 2849.28

architecture where we saw we know how

Time: 2851.319

the model architecture here they were

Time: 2852.76

describing how

Time: 2854.14

it's

Time: 2856.54

they're choosing the dimensions of these

Time: 2858.4

things based on what fits inside the

Time: 2860.14

gpus and then not only that but then

Time: 2862.24

also the specific ordering of these

Time: 2864.4

kernels and like is also chosen because

Time: 2866.8

they want specific current specific uh

Time: 2870.28

operation specific Ops to be close

Time: 2873.16

together so that they can fuse them

Time: 2875.5

right so

Time: 2877.78

the hardware is driving the model

Time: 2880.72

architecture development

Time: 2883.619

with high drop rates this allows the

Time: 2885.88

drastic improvement in compute

Time: 2886.9

efficiency and memory

Time: 2888.4

this implementation contains consists of

Time: 2890.619

randomly shuffling B batches over the

Time: 2892.3

batch Dimension and slicing the first

Time: 2893.8

one minus D batches for computations in

Time: 2897.4

the block

Time: 2899.44

so basically if you have a very high

Time: 2901.359

drop rate

Time: 2902.8

then you can save a lot on compute

Time: 2908.26

fully sharded data parallel

Time: 2910.9

so data parallelism is a form of

Time: 2913.9

distributed training right you have

Time: 2915.16

model parallelism and you have data

Time: 2916.72

parallelism one kind of way to think

Time: 2918.64

about it is that in model parallelism

Time: 2920.26

you have your model split across

Time: 2922.3

multiple devices in data parallelism you

Time: 2925.119

have your batches or your data split

Time: 2927.579

across multiple devices in in practice

Time: 2931.119

usually there's a combination of both

Time: 2933.04

there's both data parallelism and model

Time: 2935.2

parallelism

Time: 2938.26

minimizing our objective with the atom W

Time: 2940.66

Optimizer requires four model replicas

Time: 2943.96

in float32 precision

Time: 2947.2

so there's four versions of the model

Time: 2950.859

in float32

Time: 2953.02

and this is interesting here so

Time: 2955.24

I would have thought that they would

Time: 2956.56

have done this in mixed Precision so

Time: 2959.92

right when you're training models

Time: 2963.16

every single parameter in that model is

Time: 2966.04

taking 32

Time: 2967.92

units of storage right so their float 32

Time: 2970.54

takes up 32 something like a float 16

Time: 2972.88

takes half as much memory because it

Time: 2975.16

only takes 16 and then something like a

Time: 2977.68

a unit 8 takes eight and then something

Time: 2980.5

like a four byte right so there's like

Time: 2982.78

basically you can keep having the memory

Time: 2984.88

by reducing the Precision

Time: 2986.68

so mixed uh or mixed Precision training

Time: 2990.88

is something that's popular

Time: 2994.599

and I'm curious as to whether they did

Time: 2996.28

that but let's see

Time: 2998.619

uh so this sums up to 16 gigabytes of

Time: 3001.38

memory for a billion parameter model

Time: 3003.48

okay that's kind of cool

Time: 3006.66

so I could actually fit that on my 3090

Time: 3009.96

in order to reduce this memory footprint

Time: 3011.819

per GPU we split the model replicas

Time: 3013.92

across GPU

Time: 3016.26

sharding 16 gigabytes across gpus using

Time: 3018.839

the pytorch implementation of

Time: 3021.8

fsdp which is just fully shorted data

Time: 3024.18

parallel consequently the model size is

Time: 3026.339

not bounded by the memory of a single

Time: 3027.839

GPU but by the total sum of the GPU

Time: 3029.94

memory across compute nodes yeah

Time: 3031.92

so this is more model parallelism

Time: 3035.819

the pytorch implementation of fsdp

Time: 3038.94

brings a second Advantage which is to

Time: 3040.44

save on Cross GPU communication costs so

Time: 3043.56

this is another kind of theme where more

Time: 3046.92

and more uh the GPU is no longer the

Time: 3050.16

limiting factor in these training

Time: 3051.78

problems right usually it's not the fact

Time: 3053.88

that your GPU can't Matrix multiply fast

Time: 3056.16

enough which used to be the case

Time: 3057.619

nowadays the uh limiting reagent is

Time: 3061.26

actually that the GPU is sitting there

Time: 3062.88

idle waiting for information to be sent

Time: 3065.819

to a different GPU or come back to from

Time: 3068.04

the

Time: 3069.54

computer right so

Time: 3072.059

the communication between these gpus is

Time: 3074.7

actually now the limiting factor which

Time: 3076.14

is why you're seeing uh the rise of kind

Time: 3078.48

of these

Time: 3080.099

Advanced kind of like Hardware

Time: 3081.92

interconnects that like uh I think the

Time: 3084.72

best example of this is the Tesla Dojo

Time: 3086.819

chip right

Time: 3089.52

where

Time: 3091.14

they basically put these like right next

Time: 3093.72

to each other in such a way like this

Time: 3095.579

yeah so that the communication between

Time: 3097.559

these is a lot faster

Time: 3099.599

right

Time: 3100.68

because if you look at like a server

Time: 3102.42

rack right now the the data has to go

Time: 3106.2

through a pcie slot into the memory and

Time: 3109.559

then back into

Time: 3111.18

another thing right so like the

Time: 3112.74

communication is starting to become part

Time: 3114.54

of it so

Time: 3115.92

at this point people are starting to do

Time: 3117.9

more crazy things like these uh compute

Time: 3120.359

planes compute mesh right where the gpus

Time: 3123.599

are like right next to each other so

Time: 3125.16

that they can very quickly communicate

Time: 3126.42

and you're no longer limited by that

Time: 3130.2

foreign

Time: 3132.78

uh the weight shards are stored in 32

Time: 3135.48

precision as required but broadcasting

Time: 3137.4

weights and reducing gradients is done

Time: 3138.78

in floating float 16 Precision okay so

Time: 3140.88

they are doing some kind of mixed

Time: 3142.38

Precision stuff here right

Time: 3144.96

MLP head gradients are reduced in float

Time: 3147.059

32 to avoid training this leads to

Time: 3148.74

approximately 50 reduction in

Time: 3150.3

communication costs compared to the

Time: 3151.559

float 32 gradient all reduce operations

Time: 3153.54

used in distributed data parallel

Time: 3156.48

which is used in other self-supervised

Time: 3158.88

pre-training methods

Time: 3160.38

as a consequence the training procedure

Time: 3162.24

Styles scales more efficiently than DDP

Time: 3164.46

with float 16 Auto cast when scaling the

Time: 3167.339

number of GPU nodes

Time: 3170.52

overall pytorch fsdp makes Precision is

Time: 3173.52

superior

Time: 3176.64

2tp with AutoCast

Time: 3181.44

very cool and you know what's even

Time: 3183.18

cooler is that all of this is available

Time: 3185.88

right

Time: 3189

I'm telling you like low-key I feel like

Time: 3190.68

uh meta's

Time: 3192.78

meta is the open AI company meta is much

Time: 3195.359

more open they release their tools they

Time: 3197.28

talk about their tools they talk about

Time: 3198.599

the techniques

Time: 3200.22

like that's what I want to see you know

Time: 3201.9

I want to see that I like this building

Time: 3203.52

out in the open

Time: 3204.96

much more commendable than uh

Time: 3207.96

opening eyes building in secret

Time: 3211.079

most of our technical improvements to

Time: 3212.64

the trading Loop aim at improving the

Time: 3213.96

training of large models over large

Time: 3215.46

quantities of data for smaller models we

Time: 3218.16

distill them from our largest model

Time: 3221.579

instead of training them from scratch

Time: 3222.72

yeah this is huge like it seems like

Time: 3225.42

such an easy thing to like into it of

Time: 3227.52

like hey rather than training four

Time: 3228.9

different size models from scratch why

Time: 3230.339

don't we just train one really huge

Time: 3231.839

model and then just distill the smaller

Time: 3234.059

ones from the bigger one

Time: 3236.28

that does seem like

Time: 3238.079

I'm like wow why didn't people think of

Time: 3239.579

that before

Time: 3241.98

since our objective function is a form

Time: 3243.96

of distillation from the teacher Network

Time: 3245.339

to the student Network we leverage the

Time: 3246.96

same training loop with a few exceptions

Time: 3248.4

we use a larger model as a frozen

Time: 3249.839

teacher

Time: 3251.7

keep a spare EMA of the student that we

Time: 3254.64

use as our final model so this is the

Time: 3257.76

exponential moving average of the

Time: 3259.2

student they probably have multiple

Time: 3260.64

students that are being trained in

Time: 3261.9

parallel and then they basically average

Time: 3263.22

them together to have the uh

Time: 3265.559

student that they end up publishing

Time: 3271.68

look at that nice little ablation study

Time: 3273.96

look at that so they tell you here are

Time: 3275.46

all the different uh

Time: 3277.74

techniques that they described for

Time: 3279.48

trading these big models and then here

Time: 3280.8

is all the improvements that you get

Time: 3282.72

Time: 3285.599

layer scale stochastic depth teacher

Time: 3287.88

momentum tweak warm-up schedules batch

Time: 3290.28

size what's the biggest one here is this

Time: 3294

there you go dude that's or actually I

Time: 3296.099

guess this

Time: 3297.72

or reproduction I don't know what that

Time: 3299.339

means

Time: 3300.54

making the batch size big is so

Time: 3302.88

important

Time: 3303.839

that's something that like I feel like I

Time: 3305.64

learned and time and time again it's it

Time: 3310.02

just seems to be the most important part

Time: 3311.819

it's like the bigger your batch size the

Time: 3313.619

more stable you're training and the

Time: 3315.119

better the final solution that you get

Time: 3317.88

Time: 3318.72

which is unfortunate because like as an

Time: 3320.94

independent researcher as someone who

Time: 3322.68

kind of only has uh like consumer gpus

Time: 3325.92

you can't train on these giant batch

Time: 3327.839

sizes you need these kind of distributed

Time: 3329.94

systems distributed kind of like

Time: 3331.38

multiple nodes with like hundreds of

Time: 3333.78

gpus in order to have these massive

Time: 3336.72

batch sizes

Time: 3339.8

foreign

Time: 3342.319

performance as in our experiments the

Time: 3344.52

linear probe performance is lower

Time: 3345.839

bounded by the k n performance

Time: 3349.02

some modifications like layer scale and

Time: 3350.88

high stochastic depth rate 0.4

Time: 3354.359

incur a decrease in linear Pro but at

Time: 3356.339

the benefits of increasing the stability

Time: 3357.72

by avoiding Nan loss values

Time: 3361.319

these modifications allowed for the next

Time: 3362.94

set of improvements to be added

Time: 3373.04

we present a set of ablations to

Time: 3375.54

empirically validate different

Time: 3376.74

components of our pipeline the technical

Time: 3378.9

modifications the pre-training data and

Time: 3381.48

the impact of model distillation

Time: 3384.66

we consider various Downstream tasks

Time: 3386.46

that are described in section 7.

Time: 3390

okay so this is kind of a description of

Time: 3392.22

all the different ablation studies so

Time: 3393.54

basically that table but they're going

Time: 3395.099

to go through all the different parts

Time: 3397.559

here

Time: 3400.859

our approach improves the eibot method

Time: 3402.96

by combining it with several existing

Time: 3404.339

components described in section four to

Time: 3405.839

evaluate their importance we train

Time: 3406.98

multiple models where we successively

Time: 3408.72

add components to the Baseline iBot

Time: 3410.46

model

Time: 3411.54

so we report the top one accuracy so top

Time: 3414.42

one accuracy is the hardest accuracy so

Time: 3416.76

top five is the is basically as long as

Time: 3419.22

your model as long as the answer to the

Time: 3421.559

actual classification problem is within

Time: 3423.3

the top five responses with the highest

Time: 3425.64

confidence then you count that as a

Time: 3427.74

success top one accuracy is much more

Time: 3430.319

stringent it means that you have to the

Time: 3433.079

highest confidence class has to be the

Time: 3435.66

right answer

Time: 3438.78

generally we observe that each component

Time: 3440.7

improves the performance on either K N

Time: 3442.319

or linear probing only layer scale and

Time: 3444.9

stochastic depth blah blah okay

Time: 3448.8

quality of the features is directly

Time: 3451.079

related to the quality of the

Time: 3452.28

pre-training data that's a

Time: 3453.9

obvious statements 101 right there

Time: 3458.099

we randomly sample 142 million images

Time: 3460.5

from the same data source

Time: 3463.559

we train a vitg 14 on each data set for

Time: 3466.5

the same number of iterations and

Time: 3467.76

include a variant of imagenet 22k

Time: 3471.619

the most Salient observation is that

Time: 3474.119

training on a curated set of images

Time: 3475.98

works better on most benchmarks than

Time: 3478.14

training on uncurated data

Time: 3481.319

foreign

Time: 3486.18

I mean are they keeping the number of

Time: 3488.22

images the same or are they

Time: 3492.9

yeah so this is the problem is that

Time: 3496.68

they're comparing a 142 million curated

Time: 3500.16

image data set to 142 million uncurated

Time: 3504.24

images so if the size of the data set is

Time: 3506.46

the same of course the curated data is

Time: 3508.38

going to be better right but if you were

Time: 3511.44

to say 142 million curated images

Time: 3513.42

compared to 300 million uncurated images

Time: 3515.819

now I don't know if you would get the

Time: 3517.859

same result it might be the case that

Time: 3519.24

the uncurated images because they're

Time: 3520.74

just bigger would be better

Time: 3523.52

what are your thoughts on cloud gpus

Time: 3526.68

Cloud gpus are

Time: 3529.079

very useful I guess it's like kind of

Time: 3530.94

the way to go like if I was at a startup

Time: 3533.4

I wouldn't buy gpus and train things

Time: 3535.68

locally I would use the cloud the

Time: 3538.079

problem is that cloud gpus can get very

Time: 3540

expensive very quickly so unless you

Time: 3541.859

have a bunch of VC money to to burn the

Time: 3545.04

cloud gpus are generally you know I'm

Time: 3547.559

saying prohibitively expensive for

Time: 3549.299

individual people

Time: 3551.099

like if you want to mess around on

Time: 3553.74

your old GPU at home you know that's not

Time: 3556.319

that expensive but if you want to mess

Time: 3558.059

around on like a100s

Time: 3560.28

pretty soon you're going to end up with

Time: 3561.839

a couple hundred dollars of AWS bills

Time: 3564.599

you know so cloud gpus is pretty much

Time: 3567.66

the way to go it's just expensive so you

Time: 3570.299

have to be a startup or a company or

Time: 3572.579

maybe an academic institution that has

Time: 3575.04

kind of a budget

Time: 3579.059

when compared with models trained on

Time: 3580.559

imagenet22k training is also Superior to

Time: 3582.839

all the benchmarks

Time: 3584.94

okay what do we've got here ablation of

Time: 3587.22

Open Source training data we compare the

Time: 3589.559

inet 22k that was used

Time: 3592.02

okay so here you have different training

Time: 3593.7

data sets you have their 142 million

Time: 3595.799

curated you have the 142 million

Time: 3597.66

uncurated

Time: 3599.099

and you can actually see here the

Time: 3600.42

difference it's only it's only a slight

Time: 3601.859

difference or actually it's a big

Time: 3603

difference here look at that

Time: 3605.22

59 compared to 73 for the uncurated

Time: 3608.099

versus curated

Time: 3612.059

and imagenet 22k here

Time: 3615.119

it's actually very similar look at that

Time: 3619.2

so I mean what I'm learning from this is

Time: 3621.18

that the imagenet 22k data set is

Time: 3624.299

roughly equivalent to the lvd 142 mil so

Time: 3627.119

how many inet

Time: 3629.7

uh 22k

Time: 3633.119

100 22k data set

Time: 3639.299

size

Time: 3640.799

how many images does this have

Time: 3654.599

foreign

Time: 3666.059

I think I actually see it here so you

Time: 3667.44

see imagenet 22k has 14 million images

Time: 3670.079

an lvd 142 mil has 142 million images so

Time: 3673.68

this is kind of interesting here that

Time: 3675.48

this data set which has 10 times less

Time: 3677.7

data right this is 14 million images is

Time: 3680.4

getting slightly better performance on

Time: 3681.839

imagenet 1K than the lvd 142 mil

Time: 3685.319

I mean obviously it's a more specific

Time: 3687.18

data set so it makes sense that the

Time: 3689.7

imagenet data set is going to be like

Time: 3691.859

kind of better for imagenet but

Time: 3694.319

still crazy that 10 times more data

Time: 3696.059

doesn't give you a huge performance

Time: 3697.26

boost

Time: 3700.099

model size and data

Time: 3702.72

we quantify the importance of scaling

Time: 3704.46

data with model size as the model size

Time: 3705.839

grow training becomes more beneficial

Time: 3707.4

than training on imagenet 22k

Time: 3709.92

yeah so the bigger models if you trade a

Time: 3713.22

big giant model on a small data set it's

Time: 3715.02

just going to overfit hard so you can

Time: 3718.2

only train these big models if you have

Time: 3720.48

big data sets

Time: 3723.299

the two go together

Time: 3726.839

vit G trained on lvd142 matches the

Time: 3730.859

performance on imagenet 1K

Time: 3734.579

we validated the proposed technical

Time: 3736.2

improvements by adding them

Time: 3737.22

incrementally

Time: 3738.72

this section analyzes the performance

Time: 3740.46

hit observed if we ablate specific loss

Time: 3742.92

terms starting from our best performing

Time: 3745.079

model

Time: 3746.64

we ablate the importance of the coleo

Time: 3748.68

loss and the impact of the Mast image

Time: 3750.54

modeling term

Time: 3754.14

uh 80 20K so here are a couple different

Time: 3756.42

benchmarks that they're going to use to

Time: 3757.74

compare table sheet 3A shows the impact

Time: 3760.02

of using the collio loss

Time: 3763.92

model scale versus data scale so here on

Time: 3766.38

the x-axis you have the uh sizes of the

Time: 3769.02

vision Transformers that they're used

Time: 3770.28

right so l

Time: 3771.66

uh huge and then G is the biggest one

Time: 3774.96

so these are the smaller and the bigger

Time: 3776.76

and then on the X or Y axis here I guess

Time: 3780.72

this is probably the performance the top

Time: 3783.059

one performance or something like that

Time: 3785.76

so you can see here that

Time: 3788.16

uh the bigger models right are able to

Time: 3791.46

more effectively use the big data set so

Time: 3794.4

this is a inet 22k is a small data

Time: 3797.099

smaller it's like 10 times smaller than

Time: 3798.78

lvd142 mil so you can see that the when

Time: 3801.9

you have a smaller model right a vitl

Time: 3804.599

which is still a pretty big model

Time: 3806.46

the smaller model doesn't use the big

Time: 3809.52

data set as effectively as the bigger

Time: 3811.26

model

Time: 3812.099

if you give the big model and the big

Time: 3813.96

data set you get better performance but

Time: 3815.819

small model big data set doesn't do as

Time: 3817.92

well

Time: 3819.839

which is kind of what they're showing

Time: 3821.4

you here

Time: 3825.66

uh for small architectures we distill

Time: 3827.64

larger models instead of training them

Time: 3828.96

from scratch

Time: 3830.76

we use the distillation procedure

Time: 3832.26

described in section five

Time: 3837.299

uh we evaluate the effectiveness of this

Time: 3839.22

approach by comparing a vit l14 trained

Time: 3841.799

from scratch with one distilled from a

Time: 3843.359

vit G14

Time: 3845

okay this is pretty cool so

Time: 3852.18

all right so we were talking about

Time: 3854.76

distillation as a method of basically

Time: 3857.72

having smaller versions of the model

Time: 3860.28

right you train one giant model

Time: 3863.28

from scratch and then you distill the

Time: 3865.079

smaller models

Time: 3866.579

but is that going to be the same

Time: 3870.24

performance as a small model trained

Time: 3872.94

from scratch

Time: 3874.38

and that's what they're showing you here

Time: 3876.48

Time: 3877.799

the vitl trained from scratch is this

Time: 3880.559

blue and this is the score that it gets

Time: 3882.599

on all these different categories I kind

Time: 3884.22

of like this this weird table here right

Time: 3886.14

so each of these is a benchmark so cars

Time: 3890.579

food imagenet Kitty which is a kind of a

Time: 3893.579

autonomous vehicle data set and so on

Time: 3896.64

let's zoom in here

Time: 3899.819

so you can see that actually when you

Time: 3901.68

train it from scratch it's worst on

Time: 3903.24

everything like the distilled model is

Time: 3905.339

actually better on everything and and

Time: 3906.9

here's the even crazier thing the

Time: 3908.579

distilled model is better than the

Time: 3911.88

teacher model

Time: 3914.64

right like what the that's weird to

Time: 3917.22

think about you take a big model you

Time: 3919.319

train it from scratch you use the big

Time: 3921.24

model to train a smaller model right you

Time: 3924.54

the smaller models just still from the

Time: 3925.799

bigger bottle and it turns out that the

Time: 3927.359

smaller model trained on the bigger

Time: 3928.92

model

Time: 3929.94

is better at Oxford H and Paris H

Time: 3935.16

I think that's kind of weird

Time: 3936.78

right not intuitive

Time: 3945.319

we show that a vitl model distilled from

Time: 3948.54

a frozen vitg outperforms the same model

Time: 3950.94

and sometimes even outperforms the

Time: 3953.28

distillation Target

Time: 3958.619

we measure the impact of changing the

Time: 3960.599

resolution during pre-training on the

Time: 3962.339

performance image of image and Patch

Time: 3964.5

level features

Time: 3966.119

we consider models trained from scratch

Time: 3968.28

using a fixed resolution of 224

Time: 3971.76

or 416 by 416. so these are the two

Time: 3974.76

different sizes that they train at

Time: 3978.319

uh we resume for 10K more iterations at

Time: 3981.72

4 16. so they're doing this kind of like

Time: 3982.98

curriculum like alternating training on

Time: 3985.14

uh larger images and smaller images

Time: 3991.559

we report the performance of a linear

Time: 3993.599

probe evaluated at various resolutions

Time: 3995.52

the model train on high resolution

Time: 3996.78

images performs the best across

Time: 3998.64

resolutions but it comes at a high cost

Time: 4001.4

training at 416 pixels by 416 pixels

Time: 4004.88

takes three times more compute than

Time: 4007.64

training at 224.

Time: 4009.559

so there's this kind of trade-off of

Time: 4011.24

like ideally we train

Time: 4013.099

Time: 4016.039

bitter lesson seems to kill low budget

Time: 4018.02

academic AI simply scale up everything

Time: 4021.079

yeah that's true

Time: 4025.839

uh how does distilling differ from

Time: 4028.22

transfer learning so transfer learning

Time: 4029.839

is taking a model that has already been

Time: 4031.7

trained and then using it for a new task

Time: 4034.339

so in transfer learning you're still

Time: 4036.38

using the original model but when you're

Time: 4039.26

distilling you're you're training a

Time: 4041.66

separate smaller model right so

Time: 4043.339

distillation is taking a big model and

Time: 4045.079

then using it to train a small model

Time: 4046.52

transfer learning is using a model and

Time: 4049.339

then training it on a new task

Time: 4052.46

this is

Time: 4054.4

training on high resolution for only 10K

Time: 4056.72

iterations at the end of the training is

Time: 4058.099

almost as good and only requiring a

Time: 4059.78

fraction of the compute

Time: 4063.74

in this section we present the empirical

Time: 4065.78

evaluation of our models on many image

Time: 4068.059

understanding tasks

Time: 4069.98

we evaluate both Global and local image

Time: 4072.559

representations on category and instance

Time: 4074.18

level recognition

Time: 4076.76

so these are a couple different uh

Time: 4079.039

computer vision tasks right you have

Time: 4080.96

instance level recognition semantic

Time: 4083.119

segmentation monocular depth estimation

Time: 4086.119

or prediction and then action

Time: 4087.68

recognition so this is a lot of

Time: 4090.02

different type of tasks right action

Time: 4091.22

recognition is kind of like

Time: 4092.299

classification and a lot of times this

Time: 4094.7

is like a post detection task right so

Time: 4096.44

you're like kind of detecting key points

Time: 4098.179

on a human or a hand or something like

Time: 4100.58

that monocular depth estimation is

Time: 4103.219

taking a single camera image and then

Time: 4105.859

giving you the depth image from that

Time: 4109.04

so it's like a pixel level task because

Time: 4110.96

you have to identify the depth at every

Time: 4112.46

single Pixel

Time: 4113.66

instance level recognition that's more

Time: 4115.4

of like a bounding box task so it's not

Time: 4118.1

pixel level it's just giving me the

Time: 4119.779

bounding box of a specific uh instance

Time: 4122.66

right or like

Time: 4124.52

object within the image and then

Time: 4127.219

semantic segmentation is also pixel

Time: 4129.02

level because it's like you have to tell

Time: 4131.12

me what the class of every single Pixel

Time: 4132.739

in that image is so you have

Time: 4134.54

a couple different a lot of like a big

Time: 4137.359

smear of different tasks here you have

Time: 4138.98

like pixel kind of level tasks

Time: 4140.96

and you have a

Time: 4143.42

more high level things such as detection

Time: 4145.88

and action recognition

Time: 4159.56

so we train linear classifiers on top of

Time: 4161.6

the Frozen features linear classifiers

Time: 4163.16

is just a fancy way of saying like a

Time: 4164.66

little tiny model head

Time: 4166.64

right so if you had imagenet 1K the

Time: 4170.6

linear classifier is basically you're

Time: 4172.16

taking that giant pre-trained uh feature

Time: 4175.1

encoder and they freeze it so they don't

Time: 4177.14

let any gradients get into it

Time: 4179

and then you just put a little tiny head

Time: 4180.739

on top and that little tiny head has 1

Time: 4183.5

000 outputs and each of those 1000

Time: 4185.359

outputs represents one of the internet

Time: 4186.799

classes

Time: 4193.1

short duration and results close to

Time: 4194.9

training okay so what are we looking at

Time: 4196.28

here we're looking at the image

Time: 4197.239

resolution so this is 224 by 224 and

Time: 4200.3

then 768 by 768 so this is high

Time: 4202.58

resolution images and low resolution

Time: 4203.9

images on the x-axis

Time: 4205.88

and then on the y-axis you have a mean

Time: 4208.16

IOU which is basically a way to evaluate

Time: 4211.64

uh bounding boxes so

Time: 4214.88

here's a this is a low IOU this is a

Time: 4218.9

high IOU intersection over Union right

Time: 4220.76

so like how much does the bounding box

Time: 4222.5

that you predicted overlap the true

Time: 4225.14

ground truth bounding box

Time: 4226.94

right so higher is better and then

Time: 4228.679

higher is better as well on accuracy so

Time: 4230.54

accuracy is this is a classification

Time: 4232.4

task so accuracy is basically did you

Time: 4234.92

get it correct

Time: 4237.199

and you can see here that the low

Time: 4239.179

resolution

Time: 4242.179

goes down so if you're training your

Time: 4244.1

model at a low resolution it does not

Time: 4245.96

perform well at high resolution

Time: 4248.179

if you train your model at high

Time: 4249.8

resolution it does perform well at high

Time: 4251.719

resolution but then this blue line here

Time: 4254

is this kind of curriculum technique

Time: 4255.739

that they just described where they

Time: 4257.48

train in a low resolution and then they

Time: 4258.86

train at a high resolution so they like

Time: 4260.719

kind of do this two-part curriculum and

Time: 4263

they show you that okay well actually

Time: 4264.44

that works pretty much

Time: 4267.14

as good as the high resolution

Time: 4270.02

right it's still better to train at the

Time: 4271.88

high resolution if you if you wanted to

Time: 4273.8

but it's going to be so much more

Time: 4275

compute heavy that you're actually

Time: 4276.02

better off doing this kind of like

Time: 4277.04

curriculum technique where they train it

Time: 4278.42

at a low resolution and then a high

Time: 4279.62

resolution and it performs

Time: 4281.3

quite about the same

Time: 4284.36

second we show that they match or

Time: 4286.1

surpass the performance of a weekly

Time: 4287.6

supervised ones on substantial image

Time: 4289.699

number a substantial number of tasks

Time: 4293.96

in our comparisons we use two kinds of

Time: 4295.82

models as bass lines we compare the best

Time: 4297.5

performing self-supervised models that

Time: 4298.94

are openly available

Time: 4300.56

we run our evaluations for Mae Dino Seer

Time: 4303.98

MSN esvit and iBot these are just

Time: 4306.679

basically a bunch of things that they're

Time: 4307.88

comparing to

Time: 4309.92

several architectural variants were

Time: 4312.08

proposed we report results for the one

Time: 4314.06

that leads to the best top rule one

Time: 4316.34

accuracy

Time: 4318.14

we report performance of Open Source

Time: 4320.239

weekly supervised models such as clip

Time: 4321.98

okay so they're of course going to

Time: 4323.239

compare to clip

Time: 4324.8

because clip is like extremely popular

Time: 4329.26

uh for reference boba

Time: 4332.78

okay imagenet classification

Time: 4337.219

as a first evaluation we probe the

Time: 4339.679

quality of the holistic image

Time: 4341.42

representation

Time: 4344.179

produced by the model

Time: 4345.98

so what does that mean what is the

Time: 4347.36

quality of an image feature right

Time: 4349.699

this is

Time: 4353.239

this is a fundamental problem in machine

Time: 4356.3

learning in general right the quality of

Time: 4359.239

your embeddings the quality of your

Time: 4360.98

features

Time: 4362

and

Time: 4363.32

right now kind of the gold standard is

Time: 4365.96

basically to have a nice varied set of

Time: 4369.08

benchmarks right and this is not just a

Time: 4370.699

problem in computer vision it's a

Time: 4371.84

problem in uh natural language as well

Time: 4373.94

or any kind of image modality right

Time: 4376.58

how do you determine the quality of the

Time: 4379.46

features of a giant llm right

Time: 4382.88

well you have a variety of benchmarks

Time: 4385.46

and then you evaluate its performance on

Time: 4387.32

all those benchmarks

Time: 4388.699

so that's kind of the same thing that

Time: 4390.739

you're going to do here for the uh

Time: 4392.6

computer vision model is like okay well

Time: 4394.4

are these features good are they better

Time: 4395.78

than that feature or that feature right

Time: 4397.1

it's like

Time: 4398

it's impossible to know as a human

Time: 4399.56

because it's just like what what is a

Time: 4401.239

1000 dimensional Vector like

Time: 4403.699

it's basically you can't understand what

Time: 4405.86

the that even means so how can you

Time: 4407.9

judge the quality of it so

Time: 4409.76

right now the way that people do it is

Time: 4411.98

they basically just create these like

Time: 4413.36

benchmarks and they have like as big of

Time: 4415.699

a set of benchmarks as they can and then

Time: 4417.679

the model that performs the best on all

Time: 4419.179

the benchmarks has the best features but

Time: 4421.64

I suspect that over time we will start

Time: 4424.76

to learn more and more about like what

Time: 4426.679

those features actually mean maybe

Time: 4428.54

better techniques for understanding

Time: 4430.28

maybe clustering the features like

Time: 4433.4

I suspect that will develop kind of a

Time: 4435.98

whole science around feature

Time: 4438.64

understanding and kind of like

Time: 4440.96

that will become the new way to

Time: 4442.94

determine feature quality rather than

Time: 4444.8

what we're doing now which is basically

Time: 4446.12

just uh using benchmarks in order to

Time: 4448.58

kind of as a proxy for the feature

Time: 4450.32

quality

Time: 4454.88

because most SSL methods using uh

Time: 4457.94

validation performance we also Report

Time: 4459.44

top on accuracy on imagenet

Time: 4462.199

we run the evaluation with our code our

Time: 4464.36

We compare our Frozen features to the

Time: 4466.52

best publicly available SSL features

Time: 4468.08

regardless of architecture or

Time: 4469.46

pre-training data

Time: 4480.26

we also see that the performance

Time: 4481.76

increase on alternative test sets is

Time: 4483.679

larger for our method indicating

Time: 4485

stronger generalization so

Time: 4487.28

again this we don't really know how to

Time: 4489.5

measure generalization well

Time: 4491.48

other than to just basically evaluate

Time: 4493.88

the model on a variety of different

Time: 4495.38

tests and then see if it performs well

Time: 4497.54

across all of them

Time: 4499.96

we also want to validate that our

Time: 4502.34

features are competitive with

Time: 4503.54

state-of-the-art open source weekly

Time: 4505.219

supervised models

Time: 4510.86

so open clip

Time: 4514.159

Eva clip

Time: 4516.26

let's see let's see how you perform this

Time: 4518.179

is a magic number right here so we got

Time: 4519.38

clip

Time: 4520.88

with a vit large

Time: 4524.14

Time: 4526.04

and then these are different uh

Time: 4528.679

data benchmarks here and you get 79.

Time: 4533.719

you have Eva clip with a vitg so this is

Time: 4537.08

a bigger bigger clip

Time: 4538.88

83.

Time: 4541.219

Dino V2

Time: 4543.08

with the vitg bigger one 83.

Time: 4547.64

okay so it's it's not better but it's

Time: 4550.34

competitive I see what they're saying

Time: 4553.76

how does it compare to dino vits I mean

Time: 4557.179

this is a smaller one so it's not a fair

Time: 4558.92

comparison

Time: 4560.26

78.

Time: 4562.219

right small vit with only eight patches

Time: 4565.159

you get 78.

Time: 4567.199

small vit with 14 patches you get 79 so

Time: 4571.28

this is a little bit on like maybe

Time: 4573.8

unsettling it tells you that Dyno V2 is

Time: 4575.96

not necessarily that much better than

Time: 4577.34

Dino

Time: 4578.3

Time: 4579.86

really it's just bigger

Time: 4585.26

right

Time: 4593.42

and this here this 14 336 I think this

Time: 4595.94

just means that the

Time: 4597.98

the the head Dimension itself is 336 so

Time: 4601.159

it's a slightly bigger head so vitl14 is

Time: 4604.4

slightly smaller than a vitl 14336.

Time: 4618.199

can we fine-tune the encoders

Time: 4621.199

we question of the ability of our models

Time: 4623.06

to produce high quality Frozen features

Time: 4624.56

impact their performance when fine-tuned

Time: 4626.36

with supervision on a specific data set

Time: 4627.92

yeah this is important

Time: 4630.08

because I myself tried to do this right

Time: 4632.06

when I

Time: 4633.38

I was messing with the segment anything

Time: 4635.3

model and I was using the pre-trained

Time: 4637.04

feature encoder that they had

Time: 4638.96

and I had

Time: 4641.84

I was trying two different things I was

Time: 4643.159

like okay well if I freeze this

Time: 4644.96

pre-trained feature encoder and then try

Time: 4646.82

to do this segmentation task is it

Time: 4648.44

better or is it actually worse than if I

Time: 4651.02

don't freeze it and let some of the

Time: 4652.94

gradients flow through it

Time: 4654.5

and

Time: 4656.02

intuitively if you if you have been

Time: 4658.4

doing this generally the advice up until

Time: 4660.739

now is that yes letting some gradients

Time: 4663.32

go through it is better than just

Time: 4664.82

freezing it right

Time: 4666.739

but

Time: 4668.42

while this is not core this experiment

Time: 4670.1

is indicative of whether we have

Time: 4671.719

involuntarily specialized blah blah we

Time: 4674.48

apply the fine-tuning pipeline without

Time: 4676.04

tweaking hyper parameters we show that

Time: 4677.48

the top one accuracy on the validation

Time: 4678.86

set improves by more than two percent

Time: 4680.48

here you go

Time: 4683.36

but the backbone is fine-tuned so

Time: 4686.54

we're still good where it seems like

Time: 4688.46

fine-tuning you can still get a tiny bit

Time: 4691.58

of performance right

Time: 4693.98

and I don't know I feel like this is

Time: 4695.84

going to go away right

Time: 4699.56

I feel like over time these uh feature

Time: 4703.94

encoders right these pre-trained uh

Time: 4708.32

models like this right these pre-trained

Time: 4710.239

backbones I think we're going to get to

Time: 4712.159

a point where you actually don't want to

Time: 4714.199

find two of them because they're already

Time: 4716.179

so good they're already so specific

Time: 4718.699

and so General right they can kind of

Time: 4720.5

work on everything and the features that

Time: 4722.06

they use are like so fragile because

Time: 4723.86

they're they're Giant and the learning

Time: 4725.96

rates that they use are very small and

Time: 4727.58

like they're in they're in a local

Time: 4729.26

minimum that is so deep

Time: 4731.12

that if you try to find two of them you

Time: 4733.28

basically just mess them up so

Time: 4736.159

it's interesting to see that fine-tuning

Time: 4738.44

these uh

Time: 4740.06

feature encoders is still you're still

Time: 4741.679

going to get a little bit of performance

Time: 4742.76

for your specific task but I do feel

Time: 4745.52

like this is kind of eventually going to

Time: 4747.14

go away

Time: 4750.199

fine tuning is optional

Time: 4752.48

yeah that's that's the future is fine

Time: 4754.1

tuning is optional

Time: 4761.78

to complement our study and probe the

Time: 4764.179

generalization of our features we

Time: 4765.44

evaluate our imagenet 1K models trained

Time: 4767.6

with linear classification heads

Time: 4770

on domain generalization benchmarks

Time: 4773.239

okay so these are benchmarks

Time: 4774.679

specifically chosen that to be kind of

Time: 4776.96

like very wide and lots of different

Time: 4778.46

weird looking images in order to

Time: 4780.44

determine if your model is over fit to

Time: 4782.6

imagenet or not

Time: 4787.219

uh they keep using SSL here that just

Time: 4789.8

means self-supervised learning they just

Time: 4791.719

shortened it into SSL

Time: 4794.62

supervised fine-tuning on imagenet 1K

Time: 4802.76

is it ready to be used I didn't see any

Time: 4804.739

docs on how to use the model to input an

Time: 4806.48

image and get the retrieval against a

Time: 4808.04

given set of images so

Time: 4810.199

yeah you can do this so if you took uh

Time: 4813.56

Gustavo if you took this right here this

Time: 4816.26

model

Time: 4817.34

and then went created a python script

Time: 4820.179

that uh fed every single one of your

Time: 4823.159

images

Time: 4824.179

through this uh pre-trained encoder

Time: 4826.28

you're going to get a set of that you're

Time: 4827.84

gonna get a vector and then store all of

Time: 4830.3

those vectors in a vector database such

Time: 4832.76

as uh this one here faiss right

Time: 4837.4

then you can take any new image

Time: 4840.92

encode it and then find all the images

Time: 4844.58

that are similar to it so you can you

Time: 4846.14

can Implement retrieval if you want

Time: 4847.94

pretty easily

Time: 4849.62

right

Time: 4851

this is the the key is that you can

Time: 4852.5

download this model right here you don't

Time: 4853.82

even need to download it you can just

Time: 4855.08

like load it with a python

Time: 4858.32

so if you combine this with this

Time: 4862.52

you can do retrieval based on similarity

Time: 4868.12

we could actually probably do that that

Time: 4870.08

actually probably seems like a stream

Time: 4871.219

that I could do if you guys want to do

Time: 4872.54

that

Time: 4874.699

if you guys are interested in that uh

Time: 4877.46

join the Discord and then uh

Time: 4880.1

comment on that and we could totally do

Time: 4882.56

that as a stream

Time: 4885.92

okay let's get back to it

Time: 4889.699

uh vitg

Time: 4892.64

supervised fine-tuning

Time: 4896.42

so here they're showing you a

Time: 4897.86

fine-tuning a slightly different

Time: 4899.179

resolutions

Time: 4902.27

[Music]

Time: 4907.28

and you can see that the slightly larger

Time: 4909.62

resolution image benefits a little bit

Time: 4911.36

better from the fine tuning

Time: 4917.48

incorporating prompting is eventually

Time: 4919.1

better than fine tuning

Time: 4920.78

yeah uh

Time: 4922.54

Fee nugen Van I'm sorry if I didn't

Time: 4925.52

pronounce your name right but I actually

Time: 4926.659

think that that's what I envisioned I

Time: 4929.06

think right here this paper is obviously

Time: 4931.04

they trained it with no text this is a

Time: 4933.14

pure image Foundation model right it's

Time: 4934.82

not like clip clip is a image and text

Time: 4937.219

Foundation model and part of the reason

Time: 4939.32

they did that is because the data sets

Time: 4941.3

for image and text are not as good right

Time: 4944.6

but I think eventually what's going to

Time: 4947.42

happen is that you're going to basically

Time: 4948.92

use a kind of like Auto labeling

Time: 4951.32

technique right you're going to take

Time: 4952.82

images you're going to use clip to

Time: 4955.34

create a

Time: 4957.04

uh uh caption for that image and then

Time: 4960.26

you're gonna train a text and image

Time: 4962.78

model on the captioned images so I think

Time: 4966.98

over time the quality and the

Time: 4969.199

availability of image Text data is

Time: 4971.42

actually going to improve because we

Time: 4973.58

have models such as clip that can

Time: 4975.199

understand it so I see this kind of

Time: 4979.76

this giant self-supervised pipeline

Time: 4983

where you're kind of like captioning the

Time: 4984.739

models and then using that to train and

Time: 4986.54

then using the better model to caption

Time: 4987.98

the model to caption more images and

Time: 4989.78

then so on and you have this kind of

Time: 4990.92

like flywheel that like keeps labeling

Time: 4994.04

images and keeps captioning and then

Time: 4995.3

keeps labeling and then keeps captioning

Time: 4996.739

and then over time it actually gets

Time: 4998.78

better because

Time: 4999.739

I do think that intuitively it seems

Time: 5002.199

Time: 5003.159

having additional modalities right

Time: 5004.96

having both image and text will result

Time: 5007.3

in a better uh feature space than just

Time: 5010.54

images by themselves but

Time: 5013.06

we're at the point now where scale is

Time: 5015.46

still King and if you can have a bigger

Time: 5017.679

data set by just using only images the

Time: 5019.78

features that come out of that are going

Time: 5020.98

to be better

Time: 5024.1

how powerful this will be to be used as

Time: 5026.92

an image labeling tool

Time: 5030.219

is this I mean this is the what you want

Time: 5033.1

here so Gustavo this table here table

Time: 5035.08

four

Time: 5036.159

you see here uh

Time: 5039.34

clip is about 79 Eva clip is about 83

Time: 5042.58

Dyno v279 so it's not it's not going to

Time: 5045.159

be like

Time: 5046.179

significantly better than what you have

Time: 5047.92

access to already but it's kind of on

Time: 5050.56

par

Time: 5051.94

right I think if you use Dyno V2 if you

Time: 5054.159

use this mod this encoder and then just

Time: 5056.86

basically

Time: 5058.48

fine tune or not fine-tune but like use

Time: 5060.46

it for your own uh task you're probably

Time: 5062.679

going to get about the same uh

Time: 5064.659

performance as you were if you were to

Time: 5066.64

use Eva clip or any of these other kind

Time: 5068.32

of large

Time: 5069.76

foundational Vision models

Time: 5072.219

so it's not a step function it's not

Time: 5073.84

like we uh

Time: 5075.58

we suddenly unlocked a new capability it

Time: 5077.92

just seems like

Time: 5079.3

uh competitive with the current models

Time: 5091.679

domain generalization with a linear

Time: 5093.76

probe see this is more interesting so

Time: 5097.3

Frozen features so what they did here is

Time: 5099.94

they freeze the features they say okay

Time: 5102.699

I'm not going to fine tune this I'm

Time: 5104.8

going to freeze the encoder and then I'm

Time: 5106.6

going to try to see how good it performs

Time: 5108.34

on these uh benchmarks here I am R Ima

Time: 5112.42

an open clip is actually very fragile

Time: 5115.3

and you see that open clip if you freeze

Time: 5117.4

it and you don't push gradients into it

Time: 5118.84

it actually doesn't perform very well at

Time: 5120.159

all right it doesn't have the ability to

Time: 5122.38

generalize

Time: 5123.76

but look at Dino V2

Time: 5125.56

75 that's much better that means that

Time: 5128.44

the Frozen features of Dyno V2 are

Time: 5131.32

actually much more General than the

Time: 5133.78

Frozen features of these other models

Time: 5135.699

here open clip Dyno V1 Mae and so on

Time: 5139.06

so I don't know I would still use this

Time: 5141.94

if I was doing a computer vision problem

Time: 5143.56

I feel like this is the feature encoder

Time: 5145.54

I would use today

Time: 5153.34

additional image and video

Time: 5154.78

classification okay so they're using

Time: 5156.28

this for video now

Time: 5157.9

we studied the generalization of our

Time: 5159.34

features on Downstream classification

Time: 5160.9

benchmarks

Time: 5162.58

we consider two sets of evaluations in

Time: 5164.56

that context on one hand we use large

Time: 5166.6

and fine-grained data sets such as

Time: 5168.3

inaturalists and places 205 Okay so

Time: 5173.199

I think these are also classification

Time: 5174.699

tasks

Time: 5177.4

you train with a linear classifier and

Time: 5179.26

data augmentations

Time: 5182.26

our model significantly performs open

Time: 5184.36

clip

Time: 5188.56

you know this is that table right there

Time: 5192.639

we measure the performance of our model

Time: 5194.199

on video action recognition

Time: 5196.48

right so this is basically like YouTube

Time: 5198.159

videos where it's like someone cooking

Time: 5199.6

and like someone riding a bicycle and

Time: 5201.34

things like that and it's basically a

Time: 5202.78

classification task but you have a kind

Time: 5205.239

of a sequence of frames right so you're

Time: 5207.58

performing classification with a bunch

Time: 5209.56

of images rather than a single image

Time: 5214.02

we pick eight evenly spaced frames so

Time: 5217.659

it's not very dense yet right it's not

Time: 5219.219

like you're looking at a two hour video

Time: 5220.6

you're looking at eight frames of a

Time: 5222.88

video so

Time: 5224.62

video is still primitive

Time: 5229.179

we see that amongst the self-supervised

Time: 5231.52

approaches our model clearly sets a new

Time: 5233.139

state of the art

Time: 5241.84

okay so they're saying it clearly

Time: 5243.58

outperforms on ssv2 and ssv2 is

Time: 5247.36

a more complicated data set I haven't

Time: 5249.34

really heard about this what is

Time: 5250.42

something something V2 something

Time: 5253.3

something B2

Time: 5258.82

something something V2 so it seems like

Time: 5261.28

it's like kind of

Time: 5263.86

these are still pretty low resolution

Time: 5265.239

but it's like egocentric

Time: 5268.3

video of someone like grabbing things

Time: 5272.92

putting something on a Surface moving

Time: 5274.78

something up covering something with

Time: 5276.58

something putting something into

Time: 5277.84

something

Time: 5279.4

that's kind of a cool data set I've

Time: 5281.739

never heard of this

Time: 5283.48

but okay it's like kind of like an

Time: 5284.679

egocentric data set of like opening a

Time: 5286.12

jar and putting something in it so that

Time: 5287.98

is kind of interesting

Time: 5291.159

ssv2 requires a much richer

Time: 5293.199

understanding of the video frames yeah

Time: 5295.48

it's you can't just like classify based

Time: 5297.46

on texture right

Time: 5301.36

Jesus Christ

Time: 5311.76

uh we compare select Frozen features on

Time: 5314.5

12 transfer classification benchmarks

Time: 5316.36

this benchmarks cover scenes objects

Time: 5318.46

textures

Time: 5322.179

blah blah blah our model

Time: 5324.219

outperforms other self-supervised

Time: 5326.8

learning models

Time: 5330.94

okay so here you have image

Time: 5332.139

classification video classification a

Time: 5334

couple different versions here's the

Time: 5335.139

ssv2 that we just looked at open clip

Time: 5337.239

very bad performance here

Time: 5341.56

actually not it's actually not that much

Time: 5343.48

worse than Dyno V2 it I guess just ssv2

Time: 5346

is a very hard data set that's a

Time: 5348.219

hard data set look at that 35 percent is

Time: 5350.199

the state of the art on that

Time: 5352.54

that's good you know you want benchmarks

Time: 5354.88

where everyone performs poorly right you

Time: 5356.56

want a benchmark that's very very

Time: 5357.82

difficult like

Time: 5359.139

these benchmarks here where like

Time: 5360.88

everyone's scoring in the high 90s

Time: 5363.4

those aren't good benchmarks because

Time: 5364.78

basically once you get to like like 90

Time: 5367.5

95 getting that last extra percent

Time: 5372.219

is not a measure of how good your model

Time: 5375.58

is it's like a measure of how overfit

Time: 5377.32

your model is so this is why I think

Time: 5380.199

that as our models get better and better

Time: 5382.78

over time the benchmarks need to get

Time: 5384.34

harder and harder over time so

Time: 5386.62

I feel like imagenet 1K

Time: 5388.78

is not a good data set anymore or is not

Time: 5390.94

a good Benchmark anymore because it's

Time: 5392.139

like every single score is high like for

Time: 5393.94

example this here cfar10 I think that's

Time: 5396.219

what C10 means like

Time: 5397.96

this is borderline meaningless like what

Time: 5400.36

does 98.7 versus 99.5 mean right it just

Time: 5403.54

means it got one more image correct

Time: 5405.76

basically or like a couple more images

Time: 5407.26

correct and

Time: 5408.82

the reason it got those correct is

Time: 5410.44

probably not necessarily a good reason

Time: 5412.719

so these data sets here flowers like

Time: 5415.6

look at these 99.99 like

Time: 5419.02

I think we need to start getting rid of

Time: 5420.699

some of these benchmarks that are too

Time: 5421.96

easy now

Time: 5424

right this one much better Benchmark

Time: 5426.46

you're still at 80 63 87 right you can

Time: 5429.699

still actually tell what's a better

Time: 5431.5

model than the other but like

Time: 5434.32

these data sets that are just way too

Time: 5436.12

easy these benchmarks too easy

Time: 5442.54

even though these benchmarks favor text

Time: 5443.98

guided pre-training our features are

Time: 5445.48

still competitive with open clip on most

Time: 5447.04

classification benchmarks

Time: 5449.56

instance recognitions different problem

Time: 5451.48

now

Time: 5453.52

on the task of instance level

Time: 5455.26

recognition using non-parametric

Time: 5457.179

approach

Time: 5458.86

uh ranked according to their cosine

Time: 5460.9

similarity with a query image we

Time: 5463.179

evaluated our model and compared two

Time: 5465.04

baselines on Paris and Oxford

Time: 5467.62

that our Landmark recognition benchmarks

Time: 5470.86

we also evaluate on met a data set of

Time: 5473.44

artworks from the Metropolitan Museum

Time: 5477.46

okay a couple different instant

Time: 5479.02

segmentation or instance recognition

Time: 5482.679

benchmarks here

Time: 5485.8

we probe the quality of patch level

Time: 5488.02

features so

Time: 5490.6

patch level features versus just

Time: 5492.88

features right so normally when they say

Time: 5495.159

feature is what they're referring to is

Time: 5496.9

what comes out at the end of the

Time: 5498.88

pre-trained uh encoder right so you have

Time: 5501.219

your vision Transformer

Time: 5503.38

you have your image your image gets fed

Time: 5505.48

into your visual Transformer and then

Time: 5506.86

outside of that you have a feature right

Time: 5508.719

and that's the feature Vector that

Time: 5509.92

they're normally referring to when they

Time: 5511.659

say patch level features what they're

Time: 5513.34

referring to is the features in the

Time: 5515.26

little individual patches of the vision

Time: 5517.06

Transformer right the vision Transformer

Time: 5519.28

is cutting up the image into this like

Time: 5520.78

grid and then each of that little grid

Time: 5523

is getting fed into a different usually

Time: 5525.4

a different part of the vision

Time: 5526.6

Transformer so

Time: 5528.46

you can look at the features that are at

Time: 5530.56

the end of the vision Transformer or you

Time: 5532.239

can look at the features that are for

Time: 5534.1

each individual little patch that the

Time: 5536.38

vision Transformer is uh using right

Time: 5540.58

so that's uh what they mean here by the

Time: 5543.04

probing the quality of the patch level

Time: 5544.9

features as opposed to just the features

Time: 5546.52

at the very top

Time: 5553.06

uh instance level recognition so this is

Time: 5555.28

a different Benchmark right so up here

Time: 5557.86

we were looking at uh image and video

Time: 5560.62

classification this is now instance

Time: 5562.659

level recognition

Time: 5564.159

and we can see how

Time: 5566.98

again you're seeing

Time: 5568.9

Dyno V2 kind of beating out

Time: 5572.44

all the other models

Time: 5584.02

and then semantic segmentation

Time: 5587.26

a different kind of task again this is

Time: 5589.239

semantic segmentation is I mean you guys

Time: 5591.46

probably know what segmentation is

Time: 5594.4

but just in case you don't

Time: 5597.76

uh which is this right it's like

Time: 5600.1

basically individual labeling for each

Time: 5602.199

individual pixel

Time: 5604.78

hey can you suggest Which models will

Time: 5606.639

work well on Courvoisier data set

Time: 5609.6

uh for classification purposes what is

Time: 5612.699

kvossier data set

Time: 5614.38

glossier data set

Time: 5620.08

multi-class image data set for computer

Time: 5623.26

so it's like

Time: 5625.3

this

Time: 5628.9

oh he's a kind of gross dude oh

Time: 5635.76

lifted polyps

Time: 5638.38

oh my god dude this is nasty

Time: 5642.34

ulcerative colitis

Time: 5644.62

okay

Time: 5647.26

um first of all God bless you for uh

Time: 5649.659

performing medical segmentation like

Time: 5651.46

some of some of those medical image data

Time: 5653.92

set tasks are like nasty like

Time: 5656.26

the

Time: 5657.34

the ones for like skin diseases like

Time: 5660.82

like you know what I'm saying like I've

Time: 5662.199

seen some on those like skin

Time: 5663.639

disease data sets

Time: 5665.5

uh Which models will work well so

Time: 5668.92

it seems like it's like a kind of a

Time: 5670.6

classification data set so here you go

Time: 5672.1

man I mean classification Dino V2 vit

Time: 5676.12

G14

Time: 5678.28

there you go so right here this is the

Time: 5680.56

one you want

Time: 5682.179

vit G14 import torch torch.hub.load

Time: 5688.12

uh practice take this model

Time: 5691

uh take your data set this uh whatever

Time: 5693.46

you whatever this was here

Time: 5695.5

uh encode every single one of these

Time: 5697.659

images with this pre-trained Frozen

Time: 5700.179

encoder

Time: 5701.739

and then train a linear classifier on

Time: 5704.5

top of that

Time: 5706.659

that's a good start I'm not telling you

Time: 5708.4

that that's going to give you the best

Time: 5709.48

possible performance there's probably

Time: 5711.1

all kinds of extra tricks that you can

Time: 5713.02

use to make a better performance for

Time: 5715.239

this thing but that's probably a good

Time: 5716.8

start is taking this pre-trained feature

Time: 5718.48

encoder encoding all your images and

Time: 5721.179

then training a classifier on top of

Time: 5722.86

that

Time: 5729

semantic segmentation for our semantic

Time: 5732.04

segmentation evaluation we consider two

Time: 5734.08

different setups

Time: 5736.239

linear linear layer is trained to

Time: 5739.36

predict class logets from a patch tokens

Time: 5742.96

this is used to produce a low resolution

Time: 5744.699

log it map

Time: 5746.86

okay so lockets are basically the output

Time: 5749.8

of a classification head before it is

Time: 5752.199

put into a softmax and then turned into

Time: 5754.78

a confidence

Time: 5756.04

and generally your cross entropy is

Time: 5758.02

actually done with the logins because

Time: 5759.76

the way that the kind of kernels work

Time: 5761.679

out it's actually better to do it that

Time: 5763

way

Time: 5764.26

or not better but like uh

Time: 5766.179

computationally it's it's faster

Time: 5768.82

foreign

Time: 5770.139

this procedure is simple but cannot

Time: 5772.36

easily produce high resolution

Time: 5773.56

segmentations

Time: 5776.139

to report the performance of our model

Time: 5777.58

variants as well as the baselines on

Time: 5778.96

three data sets

Time: 5781.44

interestingly our evaluation is on par

Time: 5784.239

with fully fine-tuning on upper net

Time: 5786.4

decoder

Time: 5797.199

so this is kind of interesting they're

Time: 5798.58

doing like a little bit

Time: 5802.6

so like

Time: 5805.12

these pre-trained encoders right they

Time: 5807.159

give you a feature vector and the

Time: 5809.44

feature Vector is very easy to use for a

Time: 5811.659

classification task because all you got

Time: 5813.159

to do is just put like a little

Time: 5814.42

classifier head on top of that but once

Time: 5816.52

you have a more complicated task right

Time: 5818.32

like a segmentation task then maybe

Time: 5823.12

you want to have a little bit more

Time: 5825.219

complicated of ahead and that's kind of

Time: 5826.84

what they're talking about here is like

Time: 5827.8

different uh ideas right like a decoder

Time: 5831.6

a login map right like making little 32

Time: 5834.82

by 32 login map so

Time: 5837.699

a couple different there's still some

Time: 5839.02

technique still some art still some

Time: 5841.239

Artistry that you can use to uh

Time: 5845.139

design that little uh what you can do

Time: 5847.179

with those features basically like how

Time: 5848.56

are you going to take those features and

Time: 5850

use it for your classification task

Time: 5856.78

all right so here we go this is uh

Time: 5858.76

classification this is these are uh

Time: 5860.86

uh Kitty is like a autonomous vehicle

Time: 5864.159

data set

Time: 5867.219

and I guess here lower is better so

Time: 5870.82

we see that it's outperforming the clip

Time: 5874.96

but not by much

Time: 5879.58

depth estimation this is another uh

Time: 5883.9

kind of dense task similar to

Time: 5885.82

segmentation where you have to predict

Time: 5887.199

every single Pixel

Time: 5890.08

we consider three different setups we

Time: 5891.76

extract the last layer of the Frozen

Time: 5893.139

Transformer and concatenate the class

Time: 5894.699

token into each patch token

Time: 5897.96

we then bilinearly up sample the tokens

Time: 5900.94

to increase the resolution and finally

Time: 5902.56

we train a simple linear layer using a

Time: 5904.48

classification loss by dividing the

Time: 5906.1

depth prediction range to 256 uniform

Time: 5908.26

delete distributed bits so

Time: 5910.56

when you're doing depth right the depth

Time: 5913.719

image is going to basically be the same

Time: 5915.159

exact resolution as the colored image

Time: 5917.139

that you input except every single Pixel

Time: 5919.12

is going to be a number that represents

Time: 5921.1

how far away that pixel is from the

Time: 5922.719

camera right

Time: 5924.28

and because the depth image is usually a

Time: 5927.94

uint 8 Single Channel grayscale image

Time: 5930.88

right in a uint 8 you have 256 possible

Time: 5934.42

values right like 0 to 255. so

Time: 5937.659

they turn it into a classification task

Time: 5940.179

of like hey rather than trying to

Time: 5942.28

regress the exact depth number as a

Time: 5944.739

float instead of that turn it into a

Time: 5947.32

classification task and for each pixel

Time: 5948.88

I'm trying to classify each pixel into

Time: 5950.56

one of 256 different classes so it turns

Time: 5953.56

a depth estimation task into a semantic

Time: 5955.96

segmentation task with 256 classes

Time: 5963.82

they do some kind of concatenating the

Time: 5965.679

tokens from layers three six notes so

Time: 5968.199

this is kind of almost like a unit

Time: 5969.34

situation right in a unit you are not

Time: 5972.94

just taking the output of the encoder

Time: 5974.739

and then using that to do your decoding

Time: 5976.54

but you're also taking intermediate

Time: 5978.699

results from the encoder and then also

Time: 5980.5

feeding that in kind of like a skip

Time: 5982.6

connection or residual connection kind

Time: 5984.639

of idea so they're doing the same thing

Time: 5986.92

here they're they're taking uh the

Time: 5988.6

outputs from these intermediate layers

Time: 5990.28

and then also using that in the decoder

Time: 5996.84

uh interesting to see that iBot features

Time: 5999.4

outperformed the ones with opening eye

Time: 6001.32

clip

Time: 6005.639

our model with the DPT matches or

Time: 6008.1

exceeds the performance of recent work

Time: 6016.62

okay qualitative we show some

Time: 6018.179

qualitative results from our data set

Time: 6022.32

our dense prediction evaluations

Time: 6025.139

the linear segmentation produces good

Time: 6027.3

results and behaves much better under

Time: 6028.739

this evaluation setup

Time: 6031.5

the qualitative results on depth

Time: 6032.76

estimation clearly illustrate the

Time: 6034.139

quantitative gap between openai clip and

Time: 6036.179

Dyno V2

Time: 6038.3

much smoother depth estimation

Time: 6041.46

all right let's actually see these

Time: 6042.54

pictures

Time: 6043.86

okay so we got clip

Time: 6046.08

but this is the input image this is the

Time: 6048

clip this is the dyno V2

Time: 6050.58

Time: 6053.34

and you can see here

Time: 6055.739

how

Time: 6058.8

clip has all this weird like artifacts

Time: 6061.32

here

Time: 6062.34

but Dino V2 seems to be a lot cleaner

Time: 6066.12

this is the image this is uh Dyno V2 and

Time: 6069.06

then this is clip

Time: 6070.5

the dyno V2 seems better this is the

Time: 6072.96

depth estimation so again things that

Time: 6074.46

are purple are very are far away from

Time: 6076.199

the camera and then things that are kind

Time: 6077.52

of this like light yellow orange color

Time: 6079.38

are supposed to be closer to the camera

Time: 6081.42

so you can see here how the uh the clip

Time: 6084

model or the the model trained using the

Time: 6086.699

clip feature encoder that is also Frozen

Time: 6089.88

has this kind of like noise the snow as

Time: 6092.58

it's sometimes called

Time: 6094.02

the uh

Time: 6095.9

uh dino V2 much much cleaner much more

Time: 6099.179

smooth here

Time: 6102.9

yeah kind of significantly better

Time: 6120.199

out of distribution examples so

Time: 6123.48

out of distribution just means that

Time: 6125.76

there's some distribution of images

Time: 6127.5

within your data set and there's some

Time: 6129.36

distribution of images that your model

Time: 6130.8

has been trained on out of distribution

Time: 6132.96

means there's it's an image that's

Time: 6135

outside of that distribution it's an

Time: 6136.739

image that's weird it's like unusual in

Time: 6139.619

some weird way so what does that mean

Time: 6140.88

that means like a drawing right so a

Time: 6144.54

drawing of a room this almost looks like

Time: 6146.52

a van Gogh drawing but like a drawing of

Time: 6148.56

a room is

Time: 6150.3

out of distribution

Time: 6152.46

for a data set where everything you

Time: 6154.38

trained on was real pictures of rooms

Time: 6155.88

and what they're showing you here is

Time: 6157.619

like look at how their model the the

Time: 6160.38

model that they trained with the dyno V2

Time: 6162.719

Frozen feature encoder is actually still

Time: 6165.42

able to do monocular depth estimation

Time: 6167.34

and semantic segmentation on this out of

Time: 6170.1

distribution example relatively well and

Time: 6172.26

here we have the same kind of thing it's

Time: 6173.34

like a painting right this is like oil

Time: 6174.96

painting completely different kind of

Time: 6176.4

textures and and

Time: 6178.619

and patterns here than an image but you

Time: 6181.679

can still get a pretty good depth image

Time: 6184.32

and this is a little bit more not as

Time: 6187.38

good but still pretty good

Time: 6190.08

this is a very complicated image you

Time: 6191.639

have like people laying on top of other

Time: 6193.199

people

Time: 6194.76

and still gets it

Time: 6202.98

uh we show a few examples of applying

Time: 6205.56

the depth prediction and segmentation

Time: 6207.06

linear classifier to out of distribution

Time: 6208.86

examples in figure eight

Time: 6210.78

the qualitative results support our

Time: 6212.699

claim that our features transfer between

Time: 6215.699

domains

Time: 6219.3

the quality of the depth and

Time: 6220.44

segmentation and predictor for pictures

Time: 6221.699

of animals or paintings is very good

Time: 6223.08

even though the domains are very

Time: 6224.159

different

Time: 6226.199

PCA of patch features okay so now

Time: 6228.719

they're going to be doing principal

Time: 6229.86

component analysis not on the uh

Time: 6233.52

on the final maybe features but the

Time: 6236.58

features at the patch level right so

Time: 6238.26

like deeper in the vision Transformer

Time: 6240.179

looking at what's actually being done at

Time: 6242.219

the individual patches of these

Time: 6245.78

uh we showed that the results of the

Time: 6248.34

principal component analysis performed

Time: 6249.84

on each on the patch features extracted

Time: 6251.52

we keep only the patches with positive

Time: 6253.5

value after we threshold the first

Time: 6255

component

Time: 6256.86

this procedure turns out to separate the

Time: 6259.32

image's main object from the background

Time: 6260.94

Okay so

Time: 6264

basically you're getting

Time: 6266.159

this is like emergent right it's

Time: 6267.84

emergent that the model learns to

Time: 6269.88

separate the background in the

Time: 6271.139

foreground right

Time: 6275.9

I'm not sure that those data sets are

Time: 6278.1

actually out of distribution because

Time: 6279.239

that sketch is also present in high

Time: 6280.98

frequency of natural images

Time: 6283.5

should be a better test on image

Time: 6284.639

modality like ultrasound yeah I agree I

Time: 6287.1

think I agree with you that

Time: 6288.9

you can call these out of distribution

Time: 6290.76

but like there's degrees of out of

Time: 6292.619

distribution right like this this

Time: 6294.42

painting of people is probably right

Time: 6296.58

here if you have your your data

Time: 6298.32

distribution is probably right here

Time: 6299.4

versus like the X-ray image

Time: 6303.06

like imagine if someone took an x-ray of

Time: 6305.52

a rock underwater right like the X-ray

Time: 6308.28

of the rock underwater is going to be

Time: 6310.44

like here it's like way out of

Time: 6311.94

distribution so I agree with you that a

Time: 6315.239

better example of out of distribution

Time: 6316.86

images would be like x-rays or like

Time: 6319.44

sonar or like some kind of weird image

Time: 6321.48

modality that doesn't look anything like

Time: 6323.28

a natural image versus like these oil

Time: 6325.92

paintings uh sketches are not as out of

Time: 6329.94

distribution as they could be so I agree

Time: 6331.739

with you there

Time: 6333.239

foreign

Time: 6334.44

okay back to this we compute a second

Time: 6336.48

PCA on the remaining patches across the

Time: 6338.76

three images depicting the same category

Time: 6341.52

we color the first the three first

Time: 6343.38

components with three different colors

Time: 6344.82

and represents the results

Time: 6348.659

yeah so this is the images we saw at the

Time: 6350.46

beginning again kind of super

Time: 6352.139

interesting that there's this kind of

Time: 6353.46

emergent uh behavior of separating the

Time: 6356.46

foreground in the background of course

Time: 6358.32

it kind of makes sense that it would be

Time: 6359.639

it would do that because of its

Time: 6365.04

but more visualization we compute the

Time: 6367.619

PCA between the patches each component

Time: 6369.3

corresponds to a different color channel

Time: 6370.92

so three components right PCA

Time: 6375.119

you can separate your data into any

Time: 6377.46

number of Dimensions but usually it's

Time: 6378.96

separated into three components for

Time: 6380.76

visualization purposes right and here

Time: 6382.98

they're separating into r g and B so R

Time: 6385.86

represents

Time: 6387.3

the first component G probably the

Time: 6389.46

second component blue probably the third

Time: 6391.38

component

Time: 6392.34

and what they're showing you here is

Time: 6393.659

that the components for different images

Time: 6395.52

end up having

Time: 6397.08

kind of similar semantic meanings how

Time: 6399.179

like you see how the blue tends to

Time: 6401.219

represent the legs of these animals

Time: 6404.76

right the green tends to represent the

Time: 6407.94

head of the animal right

Time: 6410.76

so not only is it have this emergent

Time: 6413.76

ability to separate foreground and

Time: 6415.86

background but it also has this emergent

Time: 6417.6

ability to separate the head of

Time: 6420.3

something with the legs of something in

Time: 6423.239

the body of something so it's like it's

Time: 6424.619

kind of learning the underlying kind of

Time: 6428.28

patterns of our reality

Time: 6430.92

emergently which is pretty cool

Time: 6434.46

use component corresponds to a specific

Time: 6435.96

color

Time: 6437.82

uh delineating the boundary of the main

Time: 6440.94

object the second other components

Time: 6443.4

correspond to parts of objects

Time: 6445.86

and match as well for images of the same

Time: 6447.84