{"id":33873,"date":"2025-09-02T08:45:15","date_gmt":"2025-09-02T06:45:15","guid":{"rendered":"https:\/\/www.codemotion.com\/magazine\/?p=33873"},"modified":"2025-09-12T12:32:05","modified_gmt":"2025-09-12T10:32:05","slug":"vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion","status":"publish","type":"post","link":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/","title":{"rendered":"Visi\u00f3n Transformers y Segment Anything: De las Convoluciones a la\u00a0Atenci\u00f3n"},"content":{"rendered":"\n<p>La visi\u00f3n artificial ha avanzado de manera exponencial con el advenimiento de los Vision Transformers (ViT) y el Segment Anything Model (SAM) de Meta. Estos modelos no solo est\u00e1n redefiniendo c\u00f3mo entendemos la segmentaci\u00f3n y la clasificaci\u00f3n de im\u00e1genes, sino que tambi\u00e9n est\u00e1n marcando un cambio crucial respecto a las tradicionales redes convolucionales (CNN). En este art\u00edculo, exploramos c\u00f3mo ViT y SAM, junto con el modelo DETR para detecci\u00f3n de objetos, est\u00e1n revolucionando el panorama de la visi\u00f3n por computadora. <\/p>\n\n\n\n<p>Durante casi una d\u00e9cada, las Redes Neuronales Convolucionales (CNNs) fueron las reinas indiscutibles del <em>Deep Learning<\/em> en visi\u00f3n por computador. Desde el reconocimiento de im\u00e1genes en ImageNet hasta la detecci\u00f3n de objetos y la segmentaci\u00f3n sem\u00e1ntica, las CNNs demostraron una capacidad sin precedentes para aprender jerarqu\u00edas de caracter\u00edsticas visuales. Su estructura, inspirada en el c\u00f3rtex visual de los mam\u00edferos, utilizaba filtros convolucionales para extraer patrones locales, construyendo una representaci\u00f3n robusta del mundo visual.<\/p>\n\n\n\n<p>Sin embargo, a pesar de sus \u00e9xitos, las CNNs ten\u00edan una limitaci\u00f3n inherente: su enfoque local y secuencial. Los filtros solo \u201cve\u00edan\u201d una peque\u00f1a porci\u00f3n de la imagen a la vez. Para capturar el contexto global, se requer\u00eda una arquitectura muy profunda, lo que a menudo resultaba en una mayor complejidad computacional y la p\u00e9rdida de informaci\u00f3n de gran escala.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"h-el-giro-del-transformer\"><strong>El Giro del Transformer<\/strong><\/h2>\n\n\n\n<p>En 2017, la publicaci\u00f3n del <em>paper<\/em> \u201c<a href=\"https:\/\/research.google\/pubs\/attention-is-all-you-need\/\" rel=\"noreferrer noopener\" target=\"_blank\"><em>Attention Is All You Need<\/em><\/a>\u201d por un equipo de <strong><em>Google Brain<\/em><\/strong> cambi\u00f3 para siempre el panorama del <em>Deep Learning<\/em>. Introdujeron el concepto de <em>Transformers<\/em>, una arquitectura revolucionaria basada en el mecanismo de auto-atenci\u00f3n. Originalmente dise\u00f1ados para el procesamiento del lenguaje natural (NLP), los <em>Transformers<\/em> demostraron una capacidad asombrosa para modelar dependencias a largo plazo entre las palabras de una frase.<\/p>\n\n\n\n<p>La idea era simple pero poderosa: en lugar de procesar la informaci\u00f3n de forma secuencial, un Transformer pod\u00eda \u201catender\u201d a todas las partes de la entrada simult\u00e1neamente, asignando un peso de importancia a cada una de ellas. Esto les permit\u00eda entender el contexto global de una frase de manera mucho m\u00e1s eficiente que los modelos recurrentes (RNNs) o convolucionales.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-left\" id=\"h-por-que-ir-mas-alla-de-los-nbsp-cnn\"><strong>\u00bfPor qu\u00e9 ir m\u00e1s all\u00e1 de los&nbsp;CNN?<\/strong><\/h3>\n\n\n\n<p>Las <a href=\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/redes-neuronales-convolucionales-el-superpoder-de-la-vision-artificial\/\">CNN<\/a> capturan patrones locales con kernels, pero tienen limitaciones:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Alcance de contexto fijo y creciente complejidad con capas profundas.<\/li>\n\n\n\n<li>Dificultad para modelar relaciones globales en la imagen sin aumentar par\u00e1metros.<\/li>\n\n\n\n<li>Rigidez en escalabilidad a diferentes resoluciones y dominios.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"h-el-gran-salto-vision-transformers-vit-nbsp\"><strong>El Gran Salto: Vision Transformers (ViT)&nbsp;<\/strong><\/h2>\n\n\n\n<p>La pregunta era inevitable: \u00bfpodr\u00eda esta arquitectura, tan exitosa en el lenguaje, ser aplicada a las im\u00e1genes? En 2020, un grupo de investigadores de Google lo hizo posible con su <em>paper<\/em> \u201c<a href=\"https:\/\/research.google\/pubs\/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale\/\" rel=\"noreferrer noopener\" target=\"_blank\"><em>An Image is Worth 16&#215;16 Words: Transformers for Image Recognition at Scale<\/em><\/a>\u201d. As\u00ed nacieron los <strong>Vision Transformers (ViT)<\/strong>.<\/p>\n\n\n\n<p>La clave del \u00e9xito de ViT fue tratar una imagen como una secuencia de \u201cparches\u201d o \u201ctokens\u201d. Al igual que una frase se divide en palabras, una imagen se divide en peque\u00f1as sub-im\u00e1genes (por ejemplo, de 16&#215;16 p\u00edxeles). Cada uno de estos parches se aplanaba y se proyectaba a un espacio de incrustaciones de alta dimensi\u00f3n, similar a las incrustaciones de palabras en NLP. A partir de ah\u00ed, se aplicaba el mecanismo de auto-atenci\u00f3n del Transformer, permitiendo que cada parche \u201cviera\u201d y entendiera el contexto de todos los dem\u00e1s parches en la imagen.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-por-que-vit-fue-tan-disruptivo\"><strong>\u00bfPor qu\u00e9 ViT fue tan disruptivo?<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Contexto Global:<\/strong> A diferencia de las CNNs, ViT pod\u00eda modelar relaciones de largo alcance desde las primeras capas, sin necesidad de una arquitectura profunda. Esto le permit\u00eda comprender la escena completa y las relaciones entre los objetos de manera m\u00e1s hol\u00edstica. Atenci\u00f3n global desde la primera capa.<\/li>\n\n\n\n<li><strong>Escalabilidad:<\/strong> ViT demostr\u00f3 que, con suficientes datos de entrenamiento y una escala adecuada, pod\u00eda superar a los modelos basados en CNNs en tareas de clasificaci\u00f3n de im\u00e1genes. Escalabilidad sencilla, m\u00e1s capas y cabezas mejoran el desempe\u00f1o.<\/li>\n\n\n\n<li><strong>Transferencia de Aprendizaje:<\/strong> Al pre-entrenarse en vastos conjuntos de datos como JFT-300M (con m\u00e1s de 300 millones de im\u00e1genes), ViT pod\u00eda aprender representaciones visuales extremadamente ricas y generalizables, que luego pod\u00edan ser transferidas a tareas espec\u00edficas con un <em>fine-tuning<\/em> m\u00ednimo. Flexibilidad para tareas variadas sin modificaciones dr\u00e1sticas al backbone.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><a class=\"alt=&quot;Ejemplo de segmentaci\u00f3n con Segment Anything Model de Meta&quot;\" href=\"https:\/\/cdn.you.com\/youagent-images\/gpt-image-1\/ea52977a-15dc-4e79-b662-83d3d73d98fa.png\" target=\"_blank\" rel=\" noreferrer noopener\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*FCS64bpV49YY5M4SwazYZg.png\" alt=\"\"\/><\/a><\/figure><\/div>\n\n\n<h4 class=\"wp-block-heading\" id=\"h-ejemplo-fine-tuning-de-vit-con-nbsp-pytorch\"><strong><em>Ejemplo: Fine-tuning de ViT con&nbsp;PyTorch<\/em><\/strong><\/h4>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"PHP\" data-shcb-language-slug=\"php\"><span><code class=\"hljs language-php\">import torch\nfrom torchvision import transforms, datasets\nfrom timm import create_model\nfrom torch import nn, optim\nimport os\n\n<span class=\"hljs-comment\"># Configuraci\u00f3n<\/span>\ndevice = torch.device(<span class=\"hljs-string\">'cuda'<\/span> <span class=\"hljs-keyword\">if<\/span> torch.cuda.is_available() <span class=\"hljs-keyword\">else<\/span> <span class=\"hljs-string\">'cpu'<\/span>)\nmodel = create_model(<span class=\"hljs-string\">'vit_base_patch16_224'<\/span>, pretrained=<span class=\"hljs-keyword\">True<\/span>, num_classes=<span class=\"hljs-number\">10<\/span>).to(device)\n\n<span class=\"hljs-comment\"># Data loaders<\/span>\ntransform = transforms.Compose(&#91;\n    transforms.Resize((<span class=\"hljs-number\">224<\/span>, <span class=\"hljs-number\">224<\/span>)),\n    transforms.ToTensor(),\n])\ntrain_ds = datasets.CIFAR10(<span class=\"hljs-string\">'data'<\/span>, train=<span class=\"hljs-keyword\">True<\/span>, download=<span class=\"hljs-keyword\">True<\/span>, transform=transform)\ntrain_loader = torch.utils.data.DataLoader(train_ds, batch_size=<span class=\"hljs-number\">32<\/span>, shuffle=<span class=\"hljs-keyword\">True<\/span>)\n\n<span class=\"hljs-comment\"># Optimizaci\u00f3n y p\u00e9rdida<\/span>\noptimizer = optim.AdamW(model.parameters(), lr=<span class=\"hljs-number\">1e-4<\/span>)\ncriterion = nn.CrossEntropyLoss()\n\n<span class=\"hljs-comment\"># Entrenamiento<\/span>\nepochs = <span class=\"hljs-number\">5<\/span>  <span class=\"hljs-comment\"># Puedes aumentar si quieres<\/span>\nmodel.train()\n<span class=\"hljs-keyword\">for<\/span> epoch in range(epochs):\n    running_loss = <span class=\"hljs-number\">0.0<\/span>\n    <span class=\"hljs-keyword\">for<\/span> batch_idx, (images, labels) in enumerate(train_loader):\n        images, labels = images.to(device), labels.to(device)\n\n        optimizer.zero_grad()\n        outputs = model(images)\n        loss = criterion(outputs, labels)\n        loss.backward()\n        optimizer.step()\n\n        running_loss += loss.item()\n        \n        <span class=\"hljs-keyword\">if<\/span> (batch_idx + <span class=\"hljs-number\">1<\/span>) % <span class=\"hljs-number\">100<\/span> == <span class=\"hljs-number\">0<\/span>:\n            <span class=\"hljs-keyword\">print<\/span>(f<span class=\"hljs-string\">\"&#91;\u00c9poca {epoch+1}\/{epochs}] Batch {batch_idx+1} - Loss: {loss.item():.4f}\"<\/span>)\n\n    <span class=\"hljs-keyword\">print<\/span>(f<span class=\"hljs-string\">\"===&gt; Fin de \u00e9poca {epoch+1}, Loss promedio: {running_loss \/ len(train_loader):.4f}\"<\/span>)\n\n<span class=\"hljs-comment\"># Guardar el modelo entrenado<\/span>\nos.makedirs(<span class=\"hljs-string\">\"checkpoints\"<\/span>, exist_ok=<span class=\"hljs-keyword\">True<\/span>)\ntorch.save(model.state_dict(), <span class=\"hljs-string\">\"checkpoints\/vit_cifar10.pth\"<\/span>)\n<span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-string\">\"\u2705 Modelo guardado en 'checkpoints\/vit_cifar10.pth'\"<\/span>)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">PHP<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">php<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p><strong><em>\u00bfQu\u00e9 hace este c\u00f3digo? <\/em>Entrena un modelo ViT (Vision Transformer)<\/strong> para reconocer im\u00e1genes de la base de datos <strong>CIFAR-10<\/strong>, que contiene fotos peque\u00f1as (32&#215;32) de: aviones, autos, p\u00e1jaros, gatos, ciervos, perros, ranas, caballos, barcos y camiones. Y al final, <strong>guarda el modelo entrenado<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"h-segment-anything-model-sam-de-nbsp-meta\"><strong>Segment Anything Model (SAM) de&nbsp;Meta<\/strong><\/h2>\n\n\n\n<p>Mientras ViT transformaba la clasificaci\u00f3n de im\u00e1genes, otro gigante de la industria, Meta AI, estaba trabajando en una revoluci\u00f3n similar para la segmentaci\u00f3n de im\u00e1genes. El resultado fue el <strong>Segment Anything Model (SAM)<\/strong>, una demostraci\u00f3n asombrosa del poder de los <em>prompts<\/em> y el aprendizaje a gran escala.<\/p>\n\n\n\n<p>Tradicionalmente, la segmentaci\u00f3n requer\u00eda entrenar modelos para reconocer y delinear clases de objetos espec\u00edficas (perros, gatos, \u00e1rboles, etc.). SAM rompi\u00f3 este paradigma. En lugar de predecir clases, SAM fue dise\u00f1ado para segmentar cualquier objeto en una imagen, bas\u00e1ndose en <em>prompts<\/em> interactivos. Estos <em>prompts<\/em> pod\u00edan ser puntos, recuadros, o incluso descripciones de texto.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-como-funciona-nbsp-sam\"><strong>\u00bfC\u00f3mo funciona&nbsp;SAM?<\/strong><\/h3>\n\n\n\n<p>Se entrena con un conjunto de datos masivo de m\u00e1s de mil millones de m\u00e1scaras de segmentaci\u00f3n, generadas autom\u00e1ticamente a partir de 11 millones de im\u00e1genes. La arquitectura de SAM consiste en tres componentes principales:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Image Encoder (Codificador de Imagen):<\/strong> Un Vision Transformer (ViT) pre-entrenado que extrae las caracter\u00edsticas visuales de la imagen.<\/li>\n\n\n\n<li><strong>Prompt Encoder (Codificador de Prompt):<\/strong> Un m\u00f3dulo que procesa los <em>prompts<\/em> de entrada, ya sean puntos, cajas o texto.<\/li>\n\n\n\n<li><strong>Mask Decoder (Decodificador de M\u00e1scara):<\/strong> Un m\u00f3dulo que, bas\u00e1ndose en las caracter\u00edsticas de la imagen y las incrustaciones del <em>prompt<\/em>, genera din\u00e1micamente la m\u00e1scara de segmentaci\u00f3n del objeto deseado.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-casos-de-uso-de-nbsp-sam\"><strong>Casos de uso de&nbsp;SAM<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Segmentaci\u00f3n interactiva en edici\u00f3n de im\u00e1genes.<\/li>\n\n\n\n<li>An\u00e1lisis r\u00e1pido en medicina, industria y rob\u00f3tica.<\/li>\n\n\n\n<li>Preprocesamiento para entrenar modelos especializados.<\/li>\n<\/ul>\n\n\n\n<p>El impacto de SAM es inmenso. Al separar la tarea de segmentaci\u00f3n de la necesidad de reconocer clases predefinidas, SAM democratiza la segmentaci\u00f3n. Ahora, un solo modelo puede ser utilizado para una infinidad de tareas, desde el an\u00e1lisis m\u00e9dico hasta la creaci\u00f3n de contenido digital, simplemente proporcionando un <em>prompt<\/em> interactivo.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><a class=\"alt=&quot;Comparaci\u00f3n entre CNNs y Vision Transformers en visi\u00f3n por computador&quot;\" href=\"https:\/\/cdn.you.com\/youagent-images\/gpt-image-1\/943c6015-a8ef-447d-95c5-287c931f2e60.png\" target=\"_blank\" rel=\" noreferrer noopener\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*mgt4GzoPfG80Wsh1uIhWZg.png\" alt=\"\"\/><\/a><\/figure><\/div>\n\n\n<h4 class=\"wp-block-heading\" id=\"h-ejemplo-segmentacion-con-sam-en-nbsp-python\"><strong><em>Ejemplo: Segmentaci\u00f3n con SAM en&nbsp;Python<\/em><\/strong><\/h4>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-2\" data-shcb-language-name=\"PHP\" data-shcb-language-slug=\"php\"><span><code class=\"hljs language-php\">!pip install git+https:<span class=\"hljs-comment\">\/\/github.com\/facebookresearch\/segment-anything.git<\/span>\n!pip install opencv-python matplotlib\n\n!git <span class=\"hljs-keyword\">clone<\/span> https:<span class=\"hljs-comment\">\/\/github.com\/facebookresearch\/segment-anything.git<\/span>\n%cd segment-anything\n!wget https:<span class=\"hljs-comment\">\/\/dl.fbaipublicfiles.com\/segment_anything\/sam_vit_h_4b8939.pth -O sam_vit_h.pth<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-2\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">PHP<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">php<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p><strong><em>Instalaci\u00f3n de dependencias: <\/em><\/strong>primer paso a seguir.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-3\" data-shcb-language-name=\"PHP\" data-shcb-language-slug=\"php\"><span><code class=\"hljs language-php\">from google.colab import files\nuploaded = files.upload()\nfrom segment_anything import sam_model_registry, SamPredictor\n\nimport cv2\nimport numpy <span class=\"hljs-keyword\">as<\/span> np\nimport matplotlib.pyplot <span class=\"hljs-keyword\">as<\/span> plt\n<span class=\"hljs-comment\"># Cargar modelo SAM<\/span>\nsam = sam_model_registry&#91;<span class=\"hljs-string\">\"vit_h\"<\/span>](checkpoint=<span class=\"hljs-string\">\"sam_vit_h.pth\"<\/span>)\npredictor = SamPredictor(sam)\n<span class=\"hljs-comment\"># Cargar la imagen<\/span>\nimage = cv2.imread(<span class=\"hljs-string\">\"nombre_de_tu_imagen.jpg\"<\/span>)\nimage = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  <span class=\"hljs-comment\"># Convertir a RGB<\/span>\npredictor.set_image(image)\n<span class=\"hljs-comment\"># Definir punto para segmentar (x, y)<\/span>\ninput_point = np.<span class=\"hljs-keyword\">array<\/span>(&#91;&#91;<span class=\"hljs-number\">100<\/span>, <span class=\"hljs-number\">150<\/span>]])\ninput_label = np.<span class=\"hljs-keyword\">array<\/span>(&#91;<span class=\"hljs-number\">1<\/span>])  <span class=\"hljs-comment\"># 1 para objeto, 0 para fondo<\/span>\n<span class=\"hljs-comment\"># Hacer predicci\u00f3n<\/span>\nmasks, scores, logits = predictor.predict(\n    point_coords=input_point,\n    point_labels=input_label,\n    multimask_output=<span class=\"hljs-keyword\">True<\/span>,\n)\n<span class=\"hljs-comment\"># Mostrar m\u00e1scaras<\/span>\n<span class=\"hljs-keyword\">for<\/span> i, mask in enumerate(masks):\n    plt.figure()\n    plt.title(f<span class=\"hljs-string\">\"M\u00e1scara {i+1} - Score: {scores&#91;i]:.2f}\"<\/span>)\n    plt.imshow(image)\n    plt.imshow(mask, alpha=<span class=\"hljs-number\">0.5<\/span>)\n    plt.axis(<span class=\"hljs-string\">'off'<\/span>)\n    plt.show()<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-3\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">PHP<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">php<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p><strong><em>\u00bfQu\u00e9 est\u00e1 haciendo?: <\/em><\/strong>Define <strong>un punto de entrada<\/strong> (<code>[100,150]<\/code>) en la imagen, luego el modelo genera <strong>una o m\u00e1s m\u00e1scaras<\/strong> que representan lo que cree que hay en ese punto y visualizas esas m\u00e1scaras encima de la imagen. <\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"h-detr-deteccion-de-objetos-con-transformers\"><strong>DETR: Detecci\u00f3n de Objetos con Transformers<\/strong><\/h2>\n\n\n\n<p>La influencia de los Transformers no se detuvo en la clasificaci\u00f3n y la segmentaci\u00f3n. El <em>paper<\/em> \u201c<a href=\"https:\/\/research.facebook.com\/publications\/end-to-end-object-detection-with-transformers\/\" rel=\"noreferrer noopener\" target=\"_blank\"><em>End-to-End Object Detection with Transformers<\/em><\/a>\u201d (DETR) de Facebook AI Research (ahora Meta AI) reimagin\u00f3 por completo la detecci\u00f3n de objetos.<\/p>\n\n\n\n<p>Los modelos tradicionales de detecci\u00f3n de objetos, como Faster R-CNN o YOLO, se basan en una serie de heur\u00edsticas y componentes especializados, como anclajes de cajas (<em>anchor boxes<\/em>) y supresi\u00f3n no m\u00e1xima (NMS). DETR elimin\u00f3 todos estos componentes.<\/p>\n\n\n\n<p>DETR (DEtection TRansformer) redefine la detecci\u00f3n de objetos combinando un backbone CNN con un Transformer est\u00e1ndar y un set de queries fijas que predicen cajas y clases.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-la-innovacion-de-nbsp-detr\"><strong>\u00bfLa innovaci\u00f3n de&nbsp;DETR?<\/strong><\/h3>\n\n\n\n<p>DETR formul\u00f3 la detecci\u00f3n de objetos como un problema de predicci\u00f3n directa de un conjunto de objetos. El modelo utiliza un Transformer para predecir, de una sola vez, un conjunto fijo de predicciones, cada una de las cuales contiene las coordenadas del recuadro del objeto y su clase. Para evitar predicciones duplicadas, utiliza un <em>matching<\/em> bipartito que asigna las predicciones del modelo a las etiquetas de verdad fundamental de la forma m\u00e1s \u00f3ptima posible.<\/p>\n\n\n\n<p>DETR simplific\u00f3 dr\u00e1sticamente el <em>pipeline<\/em> de detecci\u00f3n de objetos, haci\u00e9ndolo m\u00e1s elegante y, lo que es m\u00e1s importante, eliminando la necesidad de componentes manuales que a menudo introduc\u00edan hiperpar\u00e1metros dif\u00edciles de ajustar.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-arquitectura-y-funcionamiento\"><strong>Arquitectura y funcionamiento<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Backbone CNN extrae caracter\u00edsticas espaciales.<\/li>\n\n\n\n<li>Transformer codifica relaciones globales entre parches.<\/li>\n\n\n\n<li>Queries aprenden a predecir objetos sin necesidad de anclas o NMS manual.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"h-ejemplo-deteccion-con-detr-en-nbsp-pytorch\"><strong><em>Ejemplo: Detecci\u00f3n con DETR en&nbsp;PyTorch<\/em><\/strong><\/h4>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-4\" data-shcb-language-name=\"PHP\" data-shcb-language-slug=\"php\"><span><code class=\"hljs language-php\">import torch\nimport matplotlib.pyplot <span class=\"hljs-keyword\">as<\/span> plt\nfrom PIL import Image\nimport torchvision.transforms <span class=\"hljs-keyword\">as<\/span> T\nimport requests\n\n<span class=\"hljs-comment\"># Define device<\/span>\ndevice = torch.device(<span class=\"hljs-string\">\"cuda\"<\/span> <span class=\"hljs-keyword\">if<\/span> torch.cuda.is_available() <span class=\"hljs-keyword\">else<\/span> <span class=\"hljs-string\">\"cpu\"<\/span>)\n\n<span class=\"hljs-comment\"># 1. Cargar imagen<\/span>\nimage = Image.open(<span class=\"hljs-string\">\"\/content\/auto.jpg\"<\/span>)\n\n<span class=\"hljs-comment\"># 2. Preparar transformaciones<\/span>\ntransform = T.Compose(&#91;\n    T.Resize(<span class=\"hljs-number\">800<\/span>),\n    T.ToTensor(),\n    T.Normalize(&#91;<span class=\"hljs-number\">0.485<\/span>, <span class=\"hljs-number\">0.456<\/span>, <span class=\"hljs-number\">0.406<\/span>], &#91;<span class=\"hljs-number\">0.229<\/span>, <span class=\"hljs-number\">0.224<\/span>, <span class=\"hljs-number\">0.225<\/span>])\n])\n\nimg_tensor = transform(image).unsqueeze(<span class=\"hljs-number\">0<\/span>).to(device)\n\n<span class=\"hljs-comment\"># 3. Pasar por el modelo<\/span>\n<span class=\"hljs-comment\"># Ensure the model is on the correct device<\/span>\nmodel.to(device)\noutputs = model(img_tensor)\n\n<span class=\"hljs-comment\"># 4. Aplicar softmax a logits<\/span>\nprobas = outputs&#91;<span class=\"hljs-string\">'pred_logits'<\/span>].softmax(<span class=\"hljs-number\">-1<\/span>)&#91;<span class=\"hljs-number\">0<\/span>, :, :<span class=\"hljs-number\">-1<\/span>]  <span class=\"hljs-comment\"># quitar clase \"no object\"<\/span>\nboxes = outputs&#91;<span class=\"hljs-string\">'pred_boxes'<\/span>]&#91;<span class=\"hljs-number\">0<\/span>]\n\n<span class=\"hljs-comment\"># 5. Filtrar predicciones con probabilidad &gt; 0.9<\/span>\nkeep = probas.max(<span class=\"hljs-number\">-1<\/span>).values &gt; <span class=\"hljs-number\">0.9<\/span>\nfiltered_boxes = boxes&#91;keep]\nfiltered_scores = probas&#91;keep]\nfiltered_labels = filtered_scores.argmax(<span class=\"hljs-number\">-1<\/span>)\n\n<span class=\"hljs-comment\"># 6. Dibujar sobre la imagen original<\/span>\ndef rescale_bboxes(bboxes, size):\n    img_w, img_h = size\n    b = bboxes.<span class=\"hljs-keyword\">clone<\/span>().cpu()  <span class=\"hljs-comment\"># &lt;- aqu\u00ed se asegura que todo est\u00e9 en CPU<\/span>\n    scale = torch.tensor(&#91;img_w, img_h, img_w, img_h], dtype=torch.float32)\n    b = b * scale\n    b&#91;:, :<span class=\"hljs-number\">2<\/span>] -= b&#91;:, <span class=\"hljs-number\">2<\/span>:] \/ <span class=\"hljs-number\">2<\/span>\n    b&#91;:, <span class=\"hljs-number\">2<\/span>:] += b&#91;:, :<span class=\"hljs-number\">2<\/span>]\n    <span class=\"hljs-keyword\">return<\/span> b\n\nlabels = filtered_labels.cpu().numpy()\nscores = filtered_scores.max(<span class=\"hljs-number\">-1<\/span>).values.detach().cpu().numpy()\nbboxes = rescale_bboxes(filtered_boxes.detach().cpu(), image.size).numpy()\n\n<span class=\"hljs-comment\"># 7. Mostrar<\/span>\nplt.figure(figsize=(<span class=\"hljs-number\">10<\/span>, <span class=\"hljs-number\">10<\/span>))\nplt.imshow(image)\nax = plt.gca()\n\nCLASSES = &#91;\n  <span class=\"hljs-string\">'N\/A'<\/span>, <span class=\"hljs-string\">'person'<\/span>, <span class=\"hljs-string\">'bicycle'<\/span>, <span class=\"hljs-string\">'car'<\/span>, <span class=\"hljs-string\">'motorcycle'<\/span>, <span class=\"hljs-string\">'airplane'<\/span>, <span class=\"hljs-string\">'bus'<\/span>,\n  <span class=\"hljs-string\">'train'<\/span>, <span class=\"hljs-string\">'truck'<\/span>, <span class=\"hljs-string\">'boat'<\/span>, <span class=\"hljs-string\">'traffic light'<\/span>, <span class=\"hljs-string\">'fire hydrant'<\/span>, <span class=\"hljs-string\">'N\/A'<\/span>, <span class=\"hljs-string\">'stop sign'<\/span>,\n  <span class=\"hljs-string\">'parking meter'<\/span>, <span class=\"hljs-string\">'bench'<\/span>, <span class=\"hljs-string\">'bird'<\/span>, <span class=\"hljs-string\">'cat'<\/span>, <span class=\"hljs-string\">'dog'<\/span>, <span class=\"hljs-string\">'horse'<\/span>, <span class=\"hljs-string\">'sheep'<\/span>, <span class=\"hljs-string\">'cow'<\/span>, <span class=\"hljs-string\">'elephant'<\/span>,\n  <span class=\"hljs-string\">'bear'<\/span>, <span class=\"hljs-string\">'zebra'<\/span>, <span class=\"hljs-string\">'giraffe'<\/span>, <span class=\"hljs-string\">'N\/A'<\/span>, <span class=\"hljs-string\">'backpack'<\/span>, <span class=\"hljs-string\">'umbrella'<\/span>, <span class=\"hljs-string\">'N\/A'<\/span>, <span class=\"hljs-string\">'N\/A'<\/span>, <span class=\"hljs-string\">'handbag'<\/span>,\n  <span class=\"hljs-string\">'tie'<\/span>, <span class=\"hljs-string\">'suitcase'<\/span>, <span class=\"hljs-string\">'frisbee'<\/span>, <span class=\"hljs-string\">'skis'<\/span>, <span class=\"hljs-string\">'snowboard'<\/span>, <span class=\"hljs-string\">'sports ball'<\/span>, <span class=\"hljs-string\">'kite'<\/span>,\n  <span class=\"hljs-string\">'baseball bat'<\/span>, <span class=\"hljs-string\">'baseball glove'<\/span>, <span class=\"hljs-string\">'skateboard'<\/span>, <span class=\"hljs-string\">'surfboard'<\/span>, <span class=\"hljs-string\">'tennis racket'<\/span>,\n  <span class=\"hljs-string\">'bottle'<\/span>, <span class=\"hljs-string\">'N\/A'<\/span>, <span class=\"hljs-string\">'wine glass'<\/span>, <span class=\"hljs-string\">'cup'<\/span>, <span class=\"hljs-string\">'fork'<\/span>, <span class=\"hljs-string\">'knife'<\/span>, <span class=\"hljs-string\">'spoon'<\/span>, <span class=\"hljs-string\">'bowl'<\/span>,\n  <span class=\"hljs-string\">'banana'<\/span>, <span class=\"hljs-string\">'apple'<\/span>, <span class=\"hljs-string\">'sandwich'<\/span>, <span class=\"hljs-string\">'orange'<\/span>, <span class=\"hljs-string\">'broccoli'<\/span>, <span class=\"hljs-string\">'carrot'<\/span>, <span class=\"hljs-string\">'hot dog'<\/span>, <span class=\"hljs-string\">'pizza'<\/span>,\n  <span class=\"hljs-string\">'donut'<\/span>, <span class=\"hljs-string\">'cake'<\/span>, <span class=\"hljs-string\">'chair'<\/span>, <span class=\"hljs-string\">'couch'<\/span>, <span class=\"hljs-string\">'potted plant'<\/span>, <span class=\"hljs-string\">'bed'<\/span>, <span class=\"hljs-string\">'N\/A'<\/span>, <span class=\"hljs-string\">'dining table'<\/span>,\n  <span class=\"hljs-string\">'N\/A'<\/span>, <span class=\"hljs-string\">'N\/A'<\/span>, <span class=\"hljs-string\">'toilet'<\/span>, <span class=\"hljs-string\">'N\/A'<\/span>, <span class=\"hljs-string\">'tv'<\/span>, <span class=\"hljs-string\">'laptop'<\/span>, <span class=\"hljs-string\">'mouse'<\/span>, <span class=\"hljs-string\">'remote'<\/span>, <span class=\"hljs-string\">'keyboard'<\/span>,\n  <span class=\"hljs-string\">'cell phone'<\/span>, <span class=\"hljs-string\">'microwave'<\/span>, <span class=\"hljs-string\">'oven'<\/span>, <span class=\"hljs-string\">'toaster'<\/span>, <span class=\"hljs-string\">'sink'<\/span>, <span class=\"hljs-string\">'refrigerator'<\/span>, <span class=\"hljs-string\">'N\/A'<\/span>,\n  <span class=\"hljs-string\">'book'<\/span>, <span class=\"hljs-string\">'clock'<\/span>, <span class=\"hljs-string\">'vase'<\/span>, <span class=\"hljs-string\">'scissors'<\/span>, <span class=\"hljs-string\">'teddy bear'<\/span>, <span class=\"hljs-string\">'hair drier'<\/span>, <span class=\"hljs-string\">'toothbrush'<\/span>\n]\n\n<span class=\"hljs-keyword\">for<\/span> (xmin, ymin, xmax, ymax), label, score in zip(bboxes, labels, scores):\n    ax.add_patch(plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin,\n                               fill=<span class=\"hljs-keyword\">False<\/span>, color=<span class=\"hljs-string\">'red'<\/span>, linewidth=<span class=\"hljs-number\">2<\/span>))\n    ax.text(xmin, ymin, f<span class=\"hljs-string\">'{CLASSES&#91;label]}: {score:0.2f}'<\/span>,\n            fontsize=<span class=\"hljs-number\">12<\/span>, color=<span class=\"hljs-string\">'white'<\/span>, bbox=dict(facecolor=<span class=\"hljs-string\">'red'<\/span>, alpha=<span class=\"hljs-number\">0.5<\/span>))\nplt.axis(<span class=\"hljs-string\">'off'<\/span>)\nplt.show()<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-4\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">PHP<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">php<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p><strong>El modelo DETR preentrenado: <\/strong>Usa <code>torch.hub<\/code> para cargar un modelo DETR ya entrenado en COCO. Lee una imagen (<code>example.jpg<\/code> u otra) desde internet o desde el disco local. Redimensiona, normaliza y convierte la imagen en tensor para que el modelo pueda entenderla. El modelo devuelve una lista de objetos detectados con sus clases y cajas delimitadoras (bounding boxes). Dibuja las cajas con etiquetas sobre la imagen original para mostrar qu\u00e9 se ha detectado.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"h-comparativa-cnn-vs-transformers-en-nbsp-vision\"><strong>Comparativa: CNN vs. Transformers en&nbsp;visi\u00f3n<\/strong><\/h2>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*S4HQujZBjL07z36vtT4Ebg.jpeg\" alt=\"\"\/><\/figure><\/div>\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"h-el-futuro-mas-alla-de-los-nbsp-cnns\"><strong>El Futuro: M\u00e1s All\u00e1 de los&nbsp;CNNs<\/strong><\/h2>\n\n\n\n<p>El auge de ViT, SAM y DETR no significa el fin de las CNNs, que siguen siendo extremadamente eficientes y potentes en muchas aplicaciones. Sin embargo, marca un cambio de paradigma. Los Transformers han demostrado ser excepcionalmente buenos en la comprensi\u00f3n del contexto global, la escalabilidad y la capacidad de generalizaci\u00f3n.<\/p>\n\n\n\n<p>Hoy en d\u00eda, las arquitecturas h\u00edbridas que combinan las fortalezas de las CNNs (eficiencia y capacidad de extraer caracter\u00edsticas locales) y los Transformers (comprensi\u00f3n contextual) est\u00e1n ganando terreno. El futuro de la visi\u00f3n por computador se dirige hacia modelos m\u00e1s unificados, capaces de realizar m\u00faltiples tareas (clasificaci\u00f3n, detecci\u00f3n, segmentaci\u00f3n) con una sola arquitectura.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><a class=\"alt=&quot;Funcionamiento interno del modelo SAM para segmentaci\u00f3n&quot;\" href=\"https:\/\/cdn.you.com\/youagent-images\/gpt-image-1\/a2772472-39c2-4eb9-9a77-a948b4903f63.png\" target=\"_blank\" rel=\" noreferrer noopener\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*t1iSdNjKbXHzwVzKk_LaSA.png\" alt=\"\"\/><\/a><\/figure><\/div>","protected":false},"excerpt":{"rendered":"<p>La visi\u00f3n artificial ha avanzado de manera exponencial con el advenimiento de los Vision Transformers (ViT) y el Segment Anything Model (SAM) de Meta. Estos modelos no solo est\u00e1n redefiniendo c\u00f3mo entendemos la segmentaci\u00f3n y la clasificaci\u00f3n de im\u00e1genes, sino que tambi\u00e9n est\u00e1n marcando un cambio crucial respecto a las tradicionales redes convolucionales (CNN). En&#8230; <a class=\"more-link\" href=\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/\">Read more<\/a><\/p>\n","protected":false},"author":313,"featured_media":33954,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_editorskit_title_hidden":false,"_editorskit_reading_time":0,"_editorskit_is_block_options_detached":false,"_editorskit_block_options_position":"{}","_uag_custom_page_level_css":"","_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[10598],"tags":[13076],"collections":[13068],"class_list":{"0":"post-33873","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-inteligencia-artificial","8":"tag-deep-learning-es","9":"collections-deep-learning","10":"entry"},"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.9 (Yoast SEO v26.9) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Vision Transformers y Segment Anything: Revoluci\u00f3n Visual<\/title>\n<meta name=\"description\" content=\"Descubre c\u00f3mo Vision Transformers y Segment Anything est\u00e1n revolucionando la visi\u00f3n artificial con atenci\u00f3n global, flexibilidad y precisi\u00f3n.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Visi\u00f3n Transformers y Segment Anything: De las Convoluciones a la\u00a0Atenci\u00f3n\" \/>\n<meta property=\"og:description\" content=\"Descubre c\u00f3mo Vision Transformers y Segment Anything est\u00e1n revolucionando la visi\u00f3n artificial con atenci\u00f3n global, flexibilidad y precisi\u00f3n.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/\" \/>\n<meta property=\"og:site_name\" content=\"Codemotion Magazine\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Codemotion.Italy\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-09-02T06:45:15+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-09-12T10:32:05+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png\" \/>\n\t<meta property=\"og:image:width\" content=\"800\" \/>\n\t<meta property=\"og:image:height\" content=\"520\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Orli Dun\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@CodemotionIT\" \/>\n<meta name=\"twitter:site\" content=\"@CodemotionIT\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Orli Dun\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/\"},\"author\":{\"name\":\"Orli Dun\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/37ca255c359cc54110ac89eb4fa7db42\"},\"headline\":\"Visi\u00f3n Transformers y Segment Anything: De las Convoluciones a la\u00a0Atenci\u00f3n\",\"datePublished\":\"2025-09-02T06:45:15+00:00\",\"dateModified\":\"2025-09-12T10:32:05+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/\"},\"wordCount\":1667,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png\",\"keywords\":[\"Deep Learning\"],\"articleSection\":[\"Inteligencia Artificial\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/\",\"name\":\"Vision Transformers y Segment Anything: Revoluci\u00f3n Visual\",\"isPartOf\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png\",\"datePublished\":\"2025-09-02T06:45:15+00:00\",\"dateModified\":\"2025-09-12T10:32:05+00:00\",\"description\":\"Descubre c\u00f3mo Vision Transformers y Segment Anything est\u00e1n revolucionando la visi\u00f3n artificial con atenci\u00f3n global, flexibilidad y precisi\u00f3n.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#primaryimage\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png\",\"contentUrl\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png\",\"width\":800,\"height\":520},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.codemotion.com\/magazine\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Inteligencia Artificial\",\"item\":\"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Visi\u00f3n Transformers y Segment Anything: De las Convoluciones a la\u00a0Atenci\u00f3n\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#website\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/\",\"name\":\"Codemotion Magazine\",\"description\":\"We code the future. Together\",\"publisher\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.codemotion.com\/magazine\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#organization\",\"name\":\"Codemotion\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png\",\"contentUrl\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png\",\"width\":225,\"height\":225,\"caption\":\"Codemotion\"},\"image\":{\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/Codemotion.Italy\/\",\"https:\/\/x.com\/CodemotionIT\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/37ca255c359cc54110ac89eb4fa7db42\",\"name\":\"Orli Dun\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2026\/04\/alura-profile-100x100.png\",\"contentUrl\":\"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2026\/04\/alura-profile-100x100.png\",\"caption\":\"Orli Dun\"},\"description\":\"From finance to the digital revolution! Systems Engineer | Cloud &amp; AI | Tech Creator | Community Manager at Alura Latam #foramillionfriends\",\"sameAs\":[\"https:\/\/orlidun.vercel.app\/\",\"https:\/\/www.linkedin.com\/in\/orlibetdungonzalez\"],\"url\":\"https:\/\/www.codemotion.com\/magazine\/author\/orli-dun\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Vision Transformers y Segment Anything: Revoluci\u00f3n Visual","description":"Descubre c\u00f3mo Vision Transformers y Segment Anything est\u00e1n revolucionando la visi\u00f3n artificial con atenci\u00f3n global, flexibilidad y precisi\u00f3n.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/","og_locale":"en_US","og_type":"article","og_title":"Visi\u00f3n Transformers y Segment Anything: De las Convoluciones a la\u00a0Atenci\u00f3n","og_description":"Descubre c\u00f3mo Vision Transformers y Segment Anything est\u00e1n revolucionando la visi\u00f3n artificial con atenci\u00f3n global, flexibilidad y precisi\u00f3n.","og_url":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/","og_site_name":"Codemotion Magazine","article_publisher":"https:\/\/www.facebook.com\/Codemotion.Italy\/","article_published_time":"2025-09-02T06:45:15+00:00","article_modified_time":"2025-09-12T10:32:05+00:00","og_image":[{"width":800,"height":520,"url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png","type":"image\/png"}],"author":"Orli Dun","twitter_card":"summary_large_image","twitter_creator":"@CodemotionIT","twitter_site":"@CodemotionIT","twitter_misc":{"Written by":"Orli Dun"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#article","isPartOf":{"@id":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/"},"author":{"name":"Orli Dun","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/37ca255c359cc54110ac89eb4fa7db42"},"headline":"Visi\u00f3n Transformers y Segment Anything: De las Convoluciones a la\u00a0Atenci\u00f3n","datePublished":"2025-09-02T06:45:15+00:00","dateModified":"2025-09-12T10:32:05+00:00","mainEntityOfPage":{"@id":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/"},"wordCount":1667,"commentCount":0,"publisher":{"@id":"https:\/\/www.codemotion.com\/magazine\/#organization"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#primaryimage"},"thumbnailUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png","keywords":["Deep Learning"],"articleSection":["Inteligencia Artificial"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/","url":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/","name":"Vision Transformers y Segment Anything: Revoluci\u00f3n Visual","isPartOf":{"@id":"https:\/\/www.codemotion.com\/magazine\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#primaryimage"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#primaryimage"},"thumbnailUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png","datePublished":"2025-09-02T06:45:15+00:00","dateModified":"2025-09-12T10:32:05+00:00","description":"Descubre c\u00f3mo Vision Transformers y Segment Anything est\u00e1n revolucionando la visi\u00f3n artificial con atenci\u00f3n global, flexibilidad y precisi\u00f3n.","breadcrumb":{"@id":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#primaryimage","url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png","contentUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png","width":800,"height":520},{"@type":"BreadcrumbList","@id":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/vision-transformers-y-segment-anything-de-las-convoluciones-a-la-atencion\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.codemotion.com\/magazine\/"},{"@type":"ListItem","position":2,"name":"Inteligencia Artificial","item":"https:\/\/www.codemotion.com\/magazine\/es\/inteligencia-artificial\/"},{"@type":"ListItem","position":3,"name":"Visi\u00f3n Transformers y Segment Anything: De las Convoluciones a la\u00a0Atenci\u00f3n"}]},{"@type":"WebSite","@id":"https:\/\/www.codemotion.com\/magazine\/#website","url":"https:\/\/www.codemotion.com\/magazine\/","name":"Codemotion Magazine","description":"We code the future. Together","publisher":{"@id":"https:\/\/www.codemotion.com\/magazine\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.codemotion.com\/magazine\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.codemotion.com\/magazine\/#organization","name":"Codemotion","url":"https:\/\/www.codemotion.com\/magazine\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/","url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png","contentUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2019\/11\/codemotionlogo.png","width":225,"height":225,"caption":"Codemotion"},"image":{"@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Codemotion.Italy\/","https:\/\/x.com\/CodemotionIT"]},{"@type":"Person","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/37ca255c359cc54110ac89eb4fa7db42","name":"Orli Dun","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codemotion.com\/magazine\/#\/schema\/person\/image\/","url":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2026\/04\/alura-profile-100x100.png","contentUrl":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2026\/04\/alura-profile-100x100.png","caption":"Orli Dun"},"description":"From finance to the digital revolution! Systems Engineer | Cloud &amp; AI | Tech Creator | Community Manager at Alura Latam #foramillionfriends","sameAs":["https:\/\/orlidun.vercel.app\/","https:\/\/www.linkedin.com\/in\/orlibetdungonzalez"],"url":"https:\/\/www.codemotion.com\/magazine\/author\/orli-dun\/"}]}},"featured_image_src":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png","featured_image_src_square":"https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png","author_info":{"display_name":"Orli Dun","author_link":"https:\/\/www.codemotion.com\/magazine\/author\/orli-dun\/"},"uagb_featured_image_src":{"full":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",800,520,false],"thumbnail":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network-150x150.png",150,150,true],"medium":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network-300x195.png",300,195,true],"medium_large":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",768,499,false],"large":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",800,520,false],"1536x1536":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",800,520,false],"2048x2048":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",800,520,false],"small-home-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",100,65,false],"sidebar-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",180,117,false],"genesis-singular-images":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",775,504,false],"archive-featured":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",346,225,false],"gb-block-post-grid-landscape":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",600,390,false],"gb-block-post-grid-square":["https:\/\/www.codemotion.com\/magazine\/wp-content\/uploads\/2025\/09\/convolutional-network.png",600,390,false]},"uagb_author_info":{"display_name":"Orli Dun","author_link":"https:\/\/www.codemotion.com\/magazine\/author\/orli-dun\/"},"uagb_comment_info":0,"uagb_excerpt":"La visi\u00f3n artificial ha avanzado de manera exponencial con el advenimiento de los Vision Transformers (ViT) y el Segment Anything Model (SAM) de Meta. Estos modelos no solo est\u00e1n redefiniendo c\u00f3mo entendemos la segmentaci\u00f3n y la clasificaci\u00f3n de im\u00e1genes, sino que tambi\u00e9n est\u00e1n marcando un cambio crucial respecto a las tradicionales redes convolucionales (CNN). En&#8230;&hellip;","lang":"es","_links":{"self":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/33873","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/users\/313"}],"replies":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/comments?post=33873"}],"version-history":[{"count":3,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/33873\/revisions"}],"predecessor-version":[{"id":33955,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/posts\/33873\/revisions\/33955"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/media\/33954"}],"wp:attachment":[{"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/media?parent=33873"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/categories?post=33873"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/tags?post=33873"},{"taxonomy":"collections","embeddable":true,"href":"https:\/\/www.codemotion.com\/magazine\/wp-json\/wp\/v2\/collections?post=33873"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}