logo
0
0
WeChat Login
Niels Rogge<nielsr@users.noreply.huggingface.co>
Update README.md (#10)

Model Details: DPT-Large (also known as MiDaS 3.0)

Dense Prediction Transformer (DPT) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021) and first released in this repository. DPT uses the Vision Transformer (ViT) as backbone and adds a neck + head on top for monocular depth estimation. model image

The model card has been written in combination by the Hugging Face team and Intel.

Model DetailDescription
Model Authors - CompanyIntel
DateMarch 22, 2022
Version1
TypeComputer Vision - Monocular Depth Estimation
Paper or Other ResourcesVision Transformers for Dense Prediction and GitHub Repo
LicenseApache 2.0
Questions or CommentsCommunity Tab and Intel Developers Discord
Intended UseDescription
Primary intended usesYou can use the raw model for zero-shot monocular depth estimation. See the model hub to look for fine-tuned versions on a task that interests you.
Primary intended usersAnyone doing monocular depth estimation
Out-of-scope usesThis model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

How to use

The easiest is leveraging the pipeline API:

from transformers import pipeline pipe = pipeline(task="depth-estimation", model="Intel/dpt-large") result = pipe(image) result["depth"]

In case you want to implement the entire logic yourself, here's how to do that for zero-shot depth estimation on an image:

from transformers import DPTImageProcessor, DPTForDepthEstimation import torch import numpy as np from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) processor = DPTImageProcessor.from_pretrained("Intel/dpt-large") model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large") # prepare image for the model inputs = processor(images=image, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) predicted_depth = outputs.predicted_depth # interpolate to original size prediction = torch.nn.functional.interpolate( predicted_depth.unsqueeze(1), size=image.size[::-1], mode="bicubic", align_corners=False, ) # visualize the prediction output = prediction.squeeze().cpu().numpy() formatted = (output * 255 / np.max(output)).astype("uint8") depth = Image.fromarray(formatted)

For more code examples, we refer to the documentation.

FactorsDescription
GroupsMultiple datasets compiled together
Instrumentation-
EnvironmentInference completed on Intel Xeon Platinum 8280 CPU @ 2.70GHz with 8 physical cores and an NVIDIA RTX 2080 GPU.
Card PromptsModel deployment on alternate hardware and software will change model performance
MetricsDescription
Model performance measuresZero-shot Transfer
Decision thresholds-
Approaches to uncertainty and variability-
Training and Evaluation DataDescription
DatasetsThe dataset is called MIX 6, and contains around 1.4M images. The model was initialized with ImageNet-pretrained weights.
MotivationTo build a robust monocular depth prediction network
Preprocessing"We resize the image such that the longer side is 384 pixels and train on random square crops of size 384. ... We perform random horizontal flips for data augmentation." See Ranftl et al. (2021) for more details.

Quantitative Analyses

ModelTraining setDIW WHDRETH3D AbsRelSintel AbsRelKITTI δ>1.25NYU δ>1.25TUM δ>1.25
DPT - LargeMIX 610.82 (-13.2%)0.089 (-31.2%)0.270 (-17.5%)8.46 (-64.6%)8.32 (-12.9%)9.97 (-30.3%)
DPT - HybridMIX 611.06 (-11.2%)0.093 (-27.6%)0.274 (-16.2%)11.56 (-51.6%)8.69 (-9.0%)10.89 (-23.2%)
MiDaSMIX 612.95 (+3.9%)0.116 (-10.5%)0.329 (+0.5%)16.08 (-32.7%)8.71 (-8.8%)12.51 (-12.5%)
MiDaS [30]MIX 512.460.1290.32723.909.5514.29
Li [22]MD [22]23.150.1810.38536.2927.5229.54
Li [21]MC [21]26.520.1830.40547.9418.5717.71
Wang [40]WS [40]19.090.2050.39031.9229.5720.18
Xian [45]RW [45]14.590.1860.42234.0827.0025.02
Casser [5]CS [8]32.800.2350.42221.1539.5837.18

Table 1. Comparison to the state of the art on monocular depth estimation. We evaluate zero-shot cross-dataset transfer according to the protocol defined in [30]. Relative performance is computed with respect to the original MiDaS model [30]. Lower is better for all metrics. (Ranftl et al., 2021)

Ethical ConsiderationsDescription
DataThe training data come from multiple image datasets compiled together.
Human lifeThe model is not intended to inform decisions central to human life or flourishing. It is an aggregated set of monocular depth image datasets.
MitigationsNo additional risk mitigation strategies were considered during model development.
Risks and harmsThe extent of the risks involved by using the model remain unknown.
Use cases-
Caveats and Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. There are no additional caveats or recommendations for this model.

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2103-13413, author = {Ren{\'{e}} Ranftl and Alexey Bochkovskiy and Vladlen Koltun}, title = {Vision Transformers for Dense Prediction}, journal = {CoRR}, volume = {abs/2103.13413}, year = {2021}, url = {https://arxiv.org/abs/2103.13413}, eprinttype = {arXiv}, eprint = {2103.13413}, timestamp = {Wed, 07 Apr 2021 15:31:46 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }