#computervision

waynerad@diasp.org

CES 2024: "Looking into the future".

"They are putting AI in everything."

AI pillow, AI mattress, AI office chair, AI fridges, AI washers & dryers, AI smart lamps, AI grills, AI barbecue, AI cooking gear, AI pressure cookers, AI food processors, AI air fryers, AI stethoscopes, AI bird feeders, AI telescopes, AI backpacks, AI upscaling TVs, AI realtime language translator. Most require an internet connection to work.

Manufacturing and warehouse robots, delivery robots, lawn care robots, lawnmower robots, pool cleaning robots, robot bartender, robot barista, robots that cook stir fry and make you ice cream, robot front desk assistant, hospital robot, robot to throw tennis balls for your dog and feed your dog, robot that can roll around your house and project things on the wall.

Computer vision self-checkout without scanning barcodes, computer vision food tray scanner that tells you how many calories are in the food, whether it has any allergens, and other stuff to do with the food, computer vision for vehicles.

Augmented reality form factor that is just regular glasses, 3D video without glasses, VR roller coaster haptic suits.

A car with all 4 wheels able to move independently so itcan rotate in place and move sideways for parallel parking.

A 1-person helicopter, no steering mechanism, autonomous, a car with drone propellers on the roof that can fold into the car.

Giant LED walls everywhere, a ride where people sit on a seat hanging from the ceiling while moving through a world that's all on a giant LED screen.

Transparent TVs -- possibly great for storefronts.

A water maker pulling moisture from the air, home beer making, an automated manicure system, a mouth mask that enables your friends to hear you in a video game but nobody can hear you in real life.

Crazy AI tech everywhere (the CES 2024 experience) - Matt Wolfe

#solidstatelife #ai #genai #robotics #virtualreality #augmentedreality #computervision #ces2024

waynerad@diasp.org

10 noteworthy AI research papers of 2023 that Sebastian Raschka read. That I didn't read so I'm just going to pass along his blog post with his analyses.

1) Pythia: Insights from Large-Scale Training Runs -- noteworthy because it does indeed present insights from large-scale training runs, such as, "Does pretraining on duplicated data make a difference?"

2) Llama 2: Open Foundation and Fine-Tuned Chat Models -- noteworthy because it introduces the popular LLaMA 2 family of models and explains them in depth.

3) QLoRA: Efficient Finetuning of Quantized LLMs -- this is a fine-tuning technique that is less resource intersive. LoRA stands for "low-rank adaptation" and the "Q" stands for "quantized". "Low rank" is just a fancy way of saying the added matrices have few dimensions. The "Q" part reduces resources further by "quantizing" the matrices, which means to use fewer bits and therefore lower precision for all the numbers.

4) BloombergGPT: A Large Language Model for Finance -- noteworthy because, well, not just because it's a relatively large LLM pretrained on a domain-specific dataset, but because, he says, it "made me think of all the different ways we can pretrain and finetune models on domain-specific data, as summarized in the figure below" (which is actually not in the paper).

5) Direct Preference Optimization: Your Language Model is Secretly a Reward Model -- noteworthy because it takes on the challenge head-on of replacing the Reinforcement Learning from Human Feedback (RLHF) technique. "While RLHF is popular and effective, as we've seen with ChatGPT and Llama 2, it's also pretty complex to implement and finicky." The replacement technique is called Direct Preference Optimization (DPO). This has actually been on my reading list for weeks and I haven't gotten around to it. Maybe I will one of these days and you can compare my take with his which you can read now.

6) Mistral 7B -- noteworthy, despite its brevity, because it's the base model used for the first DPO model, Zephyr 7B, which has outperformed similar sized models and set the stage for DPO to replace RLHF. It's additionally noteworthy for it's "sliding window" attention system and "mixture of experts" technique.

7) Orca 2: Teaching Small Language Models How to Reason -- noteworthy because it has the famous "distillation" technique where a large language model such as GPT-4 is used to create training data for a small model.

8) ConvNets Match Vision Transformers at Scale -- noteworthy because if you thought vision transformers relegated the old-fashioned vision technique, convolutional neural networks, to the dustbin, think again.

9) Segment Anything -- noteworthy because of the creation of the world's largest segmentation dataset to date with over 1 billion masks on 11 million images. And because in only 6 months, it has been cited 1,500 times and become part of self-driving car and medical imaging projects.

10) Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models -- this is about Emu, the text-to-video system. I'm going to be telling you all about Google's system as soon as I find the time, and I don't know if I'll ever have time to circle back and check out Emu, but I encourage you all to check it out for yourselves.

Ten noteworthy AI research papers of 2023

#solidstatelife #ai #genai #llms #convnets #rlhf #computervision #segmentation

waynerad@diasp.org

Large Vision Models. Andrew Ng and Dan Maloney say internet images are not good for training large vision models for industry, and Landing.AI is developing "domain-specific" large vision models, where "domain-specific" would mean, for example, specific for circuit boards, aerial imagery, or medical images.

The LVM (large vision model) revolution is coming a little after the LLM (large language model) one

#solidstatelife #ai #computervision

waynerad@diasp.org

Gaussian splatting: A new technique for rendering 3D scenes -- a successor to neural radiance fields (NeRF). In traditional computer graphics, scenes are represented as meshes of polygons. The polygons have a surface that reflects light, and the GPU will calculate what angle the light hits the polygons and how the polygon surface affects the reflected light -- color, diffusion, transparency, etc.

In the world of neural radiance fields (NeRF), a neural network, trained on a set of photos from a scene, will be asked: from this point in space, with the light ray going in this direction, how should the pixel rendered to the screen be colored? In other words, you are challenging a neural network to learn ray tracing. The neural network is challenged to learn all the details of the scene and store those in its parameters. It's a crazy idea, but it works. But it also has limitations -- you can't view the scene from an angle very far from the original photos, it doesn't work for scenes that are too large, and so on. At no point does it ever translate the scene into traditional computer graphics meshes, so the scene can't be used in a video game or anything like that.

The new technique goes by the unglamorous name "Gaussian splatting". This time, instead of asking for a neural network to tell you the result of a ray trace, you're asking it, initially, to render the scene as a "point cloud" -- that is to say, simply a collection of points, not even polygons. This is just the first step. Once you have the initial point cloud, then you switch to Gaussians.

The concept of a 3D "Gaussian" may take a minute to wrap your brain around. We're all familiar with the "bell curve", also called the normal distribution, also called the Gaussian distribution. This function is a function of 1 dimension, that is to say, G = f(x). To make the 3D Gaussian, you do that with all 3 dimensions. So G = f(x, y, z).

Not only that, but they make a big issue in the paper about the fact that their 3D Gaussians are "anisotropic". What this means is that they are not nice, spherical gaussians, but rather, they are stretched -- and have a direction. When rendering 3D graphics, many materials are directional, such as wood grain, brushed metal, fabric, and hair. They even have further uses, such as for scenes where the light source is not spherical, the viewing angle is very oblique, the texture is viewed from a sharp angle, and scenes that have sharp edges.

At this point you might be thinking: this all sounds a lot more complicated than simple polygons. What does using 3D Gaussians get us? The answer is that, unlike polygons, 3D Gaussians are differentiable. That magic word means you can calculate a gradient, which you might remember from your calculus class. Having a gradient means you can train it using stochastic gradient descent -- in other words, you can train it with standard deep learning training techniques. Now you've brought your 3D representation into the world of neural networks.

Even so, more cleverness is required to make the system work. The researchers made a special training system that enables geometry to be created, deleted, or moved within a scene, because, inevitably, geometry gets incorrectly placed due to the ambiguities of the initial 3D to 2D projection. After every 100 iterations, Gaussians that are "essentially transparent" are removed, while new ones are added to "densify" and fill gaps in the scene. To do this, they made an algorithm to detect regions that are not yet well reconstructed.

With these in place, the training system generates 2D images and compares them to the training views provided, and iterates until it can render them well.

To do the rendering, they developed their own fast renderer for Gaussians, which is where the word "splats" in the name comes from. When the Gaussians are rendered, they are called "splats". The reason they took the trouble to create their own rendering system was -- you guessed it -- to make the entire rendering pipeline differentiable.

3D Gaussian Splatting for real-time radiance field rendering - Inria/GraphDeco GraphDeco Inria Research Group

#solidstatelife #ai #genai #computervision #computergraphics

waynerad@diasp.org

AI headlines: ChatGPT Vision, ChatGPT Browsing, OpenAI Dall E 3, Jony Ive Joining OpenAI?, Tesla Optimus Robot, Meta AI, Meta EMU, Meta AI Glasses, Meta Quest 3, Mistral 7B, Amazon X Anthropic, Amazon Alexa AI, Leonardo Elements, Microsoft Copilot Windows 11, SpaceX Gov't Contract, CIA LLM, YouTube AI, Google Gemini Launch.

OpenAI Dall-E 3 is "Now on par with the newest version of Midjourney" (is it just me or do these comparisons seem to not take into account the different art generation models have different styles?),

ChatGPT Vision gives ChatGPT audio and image input that enables it to do voice dialog and read images.

ChatGPT Browsing means you can do web browsing inside ChatGPT and ChatGPT can incorporate information from the web browsing.

Jony Ive, the key designer behind the iPod and iPhone, is reportedly in talks with OpenAI and might join OpenAI. Rumor is OpenAI might make some kind of "consumer device".

Tesla's Optimus humanoid robot has improved a lot and he (the YouTuber, Matthew Berman) speculates it is improving so rapidly it may overtake Boston Dynamics robots in capability.

"Meta AI" is a new AI is supposed to be an advanced conversational assistant from the company formerly known as Facebook.

"Meta EMU" is the company formerly known as Facebook's AI generative art product. They plan to build it in to Instagram.

"Meta AI Glasses" are basically Google Glass but look like ordinary sunglasses and are made through a partnership between the company formerly known as Facebook and Ray-Ban. Except unlike Google Glass, you'll use AI to tell it to take photos, play music, and make phone calls.

"Meta Quest 3" is the company formerly known as Facebook's new VR headset, although they've dropped the term "VR" and I'm not sure this announcement has anything to do with AI.

Mistral 7B is the first AI model launched by Minstral (whoever they are). Apparently it's totally open source and competes well with the company formerly known as Facebook's Code LLaMa models.

Amazon X Anthropic is Anthropic's Claude model provided through Amazon. Apparently Amazon has decided to cozy up to Anthropic like Microsoft cozied up to OpenAI, with Amazon announcing a long-term collaborative deal coupled with a $4 billion investment in Anthropic.

Amazon Alexa AI is reported to be bringing generative AI to Amazon's Alexa product. The new AI is said to be more conversational and will respond to your tone of voice, body language, gestures, and eye contact.

Leonardo Elements is a generative AI art system that allows you to add additional effects to your generated images.

Microsoft Windows 11 With Copilot is a system to allow you to not only chat with your Windows operating system and get help, but can actually control the system and do things for you.

SpaceX will provide a networking service to the military called StarField, which will use Starlink satellites.

The CIA is building its own large language model. Apparently it will be trained on publicly available Chinese social media data.

YouTube AI is a bunch of AI features for content creators. Dream Screen allows you to visually transport yourself anywhere just by typing a prompt.

Personalized AI Insights gives you video ideas and outlines tailored to you based on your channel activity.

Auto-Dubbing with Aloud enables you to localize your video into other languages.

Assistive Music Search finds music for your video.

Google Gemini is planned for launch soon, and will be a multi-modal model from launch.

Biggest week in AI news in months! (ep10) - Matthew Berman

#solidstatelife #ai #genai #computervision #llms

waynerad@diasp.org

"Funky AI-generated spiraling medieval village captivates social media."

Reminds me of M.C. Escher.

And check out the slideshow in the middle of the checkerboard medieval village.

The images are made with ControlNet. ControlNet allow you to, in addition to the text prompt, provide an additional image which will "condition" the generated image. If the condition is a simple black-and-white pattern, like a spiral or checkerboard, it turns out this effect is created in the output.

How exactly ControlNet works I can't really explain for you. Looking through the paper I can tell you that what it does is take a diffusion network and "lock down" the parameters. Then, on a block-by-block basis, it copies the blocks and makes the parameters on the copy changeable again. Diffusion networks are made of ResNet (residual network) "blocks" where a "block" is a group of layers of the same dimensions that function together.

While the original blocks are connnected together in a sequence, the copied blocks are treated as "nested", with the output from the first blocks going back into the original network near the end, and the output from later blocks feeding back into the original network near the center. When the output is fed back in, it first goes through a convolution the inventors call a "zero convolution". This doesn't mean the area used for convolution has zero size, like you might think. No, the "zero" refers to the idea that the convolution weights are initialized to all zeros, and grow from there to some optimized values over the course of training.

Anyway, somehow this creates a tension between the original image and the additional "condition" image such that the locked-down network pushes to preserve the style and structure of the original image while the trainable copy pushes to match the pattern in the condition image, and the end result is a combination of both.

Funky AI-generated spiraling medieval village captivates social media

#solidstatelife #ai #computervision #genai #diffusionnetworks #controlnet

waynerad@diasp.org

An OCR system that can convert PDFs of scientific papers dense with mathematical equations has been developed. For mathematical equations, it outputs the LaTeX format.

"Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost. Existing Optical Character Recognition (OCR) engines, such as Tesseract OCR, excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial. Converting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset, capture the text of 12M2 papers using GROBID, but are missing meaningful representations of the mathematical equations. To this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text."

The researchers have released a pre-trained model capable of converting a PDF to a lightweight markup language.

"Our method is only dependent on the image of a page, allowing access to scanned papers and books."

"To the best of our knowledge there is no paired dataset of PDF pages and corresponding source code out there, so we created our own from the open access articles on arXiv. For layout diversity we also include a subset of the PubMed Central (PMC) open access non-commercial dataset. During the pretraining, a portion of the Industry Documents Library (IDL) is included."

The model they came up to do this is called Nougat, "an end-to-end trainable encoder-decoder transformer based model for converting document pages to markup." It's basically a vision transformer model.

A lot of the paper is concerted with technicalities such as splitting pages and ignoring headers and footers with page numbers and various compression and distortion artifacts, blur, and noise, that can exist in the image to be OCRed.

To measure the performance of the model, they calculated edit distance, BLEU score, METEOR score, and F1-score.

"The edit distance, or Levenshtein distance, measures the number of character manipulations (insertions, deletions, substitutions) it takes to get from one string to another. In this work we consider the normalized edit distance, where we divide by the total number of characters."

"The BLEU metric was originally introduced for measuring the quality of text that has been machinetranslated from one language to another. The metric computes a score based on the number of matching n-grams between the candidate and reference sentence.

METEOR is "another machine-translating metric with a focus on recall instead of precision."

The F1-score incorporates both precision and recall, and "We also compute the F1-score and report the precision and recall."

They compared with a previous OCR system, GROBID with LaTeX OCR. For edit distance, GROBID with LaTeX OCR got 0.727, while Nougat Small (250 million parameters) got 0.117 and Nougat Base (350 million parameters) got 0.128 on math equations. On edit distance, smaller is better. For BLUE, the numbers were 0.3 for GROBID + LaTeX OCR, 56.0 for Nougat Small and 56.9 for Nougat Base -- larger is better. On METEOR, the numbers were 5.0 for GROBID + LaTeX OCR, 74.7 for Nougat Small and 75.4 for Nougat Base -- larger is better. For F1, the numbers were 9.7 for GROBID + LaTeX OCR, 76.9 for Nougat Small, and 76.5 for Nougat Base -- larger is better.

This sounds like something that could be incredibly useful.

Nougat: Neural optical understanding for academic documents

#solidstatelife #ai #computervision #ocr #latex

waynerad@diasp.org

Bots have surpassed humans at solving CAPTCHAs.

CAPTCHA Type: reCAPTCHA (click)
Human Time: 3.1-4.9 seconds
Human Accuracy: 71-85%
Bot Time: 1.4
Bot Accuracy: 100%

CAPTCHA Type: Geetest
Human Time: 28-30
Human Accuracy: N/A
Bot Time: 5.3
Bot Accuracy: 96%

CAPTCHA Type: Arkose
Human Time: 18-42
Human Accuracy: N/A
Bot Time: N/A
Bot Accuracy: N/A

CAPTCHA Type: Distorted Text
Human Time: 9-15.3
Human Accuracy: 50-84%
Bot Time: <1
Bot Accuracy: 99.8%

CAPTCHA Type: reCAPTCHA (image)
Human Time: 15-26
Human Accuracy: 81%
Bot Time: 17.5
Bot Accuracy: 85%

CAPTCHA Type: hCAPTCHA
Human Time: 18-32
Human Accuracy: 71-81%
Bot Time: 14.9
Bot Accuracy: 98%

Quoting from the paper about how the human scores were calculated.

"To understand the landscape of modern CAPTCHAS and guide the design of the subsequent user study, we manually inspected the 200 most popular websites from the Alexa Top Website list."

"Our goal was to imitate a normal user's web experience and trigger CAPTCHAS in a natural setting. Although CAPTCHAS can be used to protect any section or action on a website,
they are often encountered during user account creation to prevent bots creating accounts."

"The most prevalent types were: reCAPTCHA was the most prevalent, appearing on 68 websites (34% of the inspected websites)."

"Slider-based CAPTCHAS appeared on 14 websites (7%)."

"Distorted Text CAPTCHAS appeared on 14 websites (7%)."

"Game-based CAPTCHAS appeared on 9 websites (4.5%)."

"hCAPTCHA appeared on 1 website."

"Other CAPTCHAs found during our inspection included: a CAPTCHA resembling a scratch-off lottery ticket; a CAPTCHA asking users to locate Chinese characters within an image;
and a proprietary CAPTCHA service called 'NuCaptcha'."

"Having identified the relevant CAPTCHA types, we conducted a 1,000 participant online user study to evaluate real users' solving times and preferences for these types of CAPTCHAS. Our study was run using using Amazon MTurk."

They have a table where they list out the ages, countries, education, gender, device type, input method, and internet use. They seem to have a sufficiently broad spectrum of humans.

Ages: 30-39: 531, 20-29: 403, 40-49: 271, 50-59: 106, 60+: 58, 18-19: 31. Countries: USA 985, India: 240, Brazil: 50, Italy: 27, UK: 24, Other: 74. Education: Bachelors: 822, Masters: 243, high school: 210, Associates: 98, PhD: 24, no degree: 3. Gender: male: 832, female: 557, nonbinary: 11. Device type: computer: 1301, phone: 74, tablet: 25. Input method: keyboard 1261, touch: 125, other: 14. Internet use: work: 860, web surf: 397, education: 87, gaming: 30, other: 26.

"Direct setting: This setting was designed to match previous CAPTCHA user studies, in which participants are directly asked to solve CAPTCHAS."

"Contextualized setting: This setting was designed to measure CAPTCHA solving behavior in the context of a typical web activity."

"For reCAPTCHA, the selection between image- or click based tasks is made dynamically by Google. Whilst we know that 85% and 71% of participants (easy and hard setting) were shown a click-based CAPTCHA, the exact task-to-participant mapping is not revealed to website operators. We therefore assume that the slowest solving times correspond to imagebased tasks. After disambiguation, click-based reCAPTCHA had the lowest median solving time at 3.7 seconds. Curiously, there was little difference between easy and difficult settings."

"The next lowest median solving times were for distorted text CAPTCHAS. As expected, simple distorted text CAPTCHAS were solved the fastest. Masked and moving versions had very similar solving times. For hCAPTCHA, there is a clear distinction between easy and difficult settings."

"The latter consistently served either a harder image-based task or increased the number of rounds. However, for both hCAPTCHA settings, the fastest solving times are similar to those of reCAPTCHA and distorted text. Finally, the gamebased and slider-based CAPTCHAS generally yielded higher median solving times, though some participants still solved
these relatively quickly (e.g., < 10 seconds)."

"With the exception of reCAPTCHA (click) and distorted text, we observed that solving times for other types have a relatively high variance."

"reCAPTCHA: The accuracy of image classification was 81% and 81.7% on the easy and hard settings respectively. Surprisingly, the difficulty appeared not to impact accuracy."

"hCAPTCHA: The accuracy was 81.4% and 70.6% on the easy and hard settings respectively. This shows that, unlike reCAPTCHA, the difficulty has a direct impact on accuracy."

"Distorted Text: We evaluated agreement among participants as a proxy for accuracy."

If you're wondering what AI systems they used to crack the CAPTCHAs... well... They didn't actually run any AI systems on the CAPTCHAs. They scoured "the literature" for AI performance scores. And they didn't provide a convenient table listing the sources for all their numbers on the AI performance. They have a section of references, of course, but there are 77 references. The paper focuses totally on the human testing and demographic breakdowns of it.

An Empirical Study & Evaluation of Modern CAPTCHAs

#solidstatelife #ai #computervision #captchas

waynerad@diasp.org

AI filmmaking is not the future, says Patrick (H) Willems. But what about those Wes Anderson trailers that people make in a matter of days that look shockingly like Wes Anderson's movies? Well, they actually copy the most obvious aesthetics of Wes Anderson while overlooking all the even slightly more subtle ones. Worst of all, they completely overlook Wes Anderson's humanity. They don't engage with his work on a deeper level. AI reduces Wes Anderson to "an Instagram filter". His movies do not consist of static shots with the same framing. There are incredibly kinetic shots, slow motion dolly shots, sparingly used chaotic handheld shots, stop motion work, action scenes, and, depending on the film, he uses different aspect ratios, film stocks, and color palettes.

None of the AI systems make stories for their hypothetical movies that are anything like what the real Wes Anderson would create. Wes Anderson movies are fun, but at the same time, they are all sad. In Asteroid City, a character is trying to figure out how to break the news to his children that their mother has died. The Darjeeling Limited is about three brothers processing the recent death of their father. The Life Aquatic is about a man seeking revenge for the death of his best friend and along the way he meets his estranged son who (spoilers) also dies. In Asteroid City, finding meaning in a unique location or community happens when the characters are quarantined in a small desert town. In The Grand Budapest Hotel, a young refugee finds his purpose under the tutelage of an eccentric hotel concierge. In The French Dispatch, it's the collection of writers for the titular magazine. In The Life Aquatic, it's the boat. In The Royal Tenenbaums, a dysfunctional family reunites in their childhood home.

The AI parodies could pick up on these themes, but they don't. In the Star Wars parody, the AI systems don't have Luke Skywalker grieving the deaths of his aunt and uncle, then finding a new purpose on the Millennium Falcon with an eccentric bunch of rebels. The AI is incapable of going any deeper than, "Bill Murray is in a lot of Wes Anderson movies, and he's an old man. Gandalf is an old man. So here, nightmarishly, is Bill Murray's face pasted on Gandalf's head." Wes Anderson makes comedies, but none of the AI parodies are remotely funny.

How might Wes Anderson actually do X-Men? He talks about how a human would do it. Actually he already did it, so he describes how he did it. He and a bunch of friends. His friends made the costumes, shopped for props at thrift stores, built the sets, and played the characters.

AI filmmaking is not the future, it's a grift - Patrick (H) Willems

#solidstatelife #ai #computervision #genai #filmmaking #wesanderson

waynerad@diasp.org

GPT-4 easily solves CAPTCHA.

I got to try GPT-4's multimodal capabilities and it's quite impressive! A quick thread of examples... - Tanishq Mathew Abraham (@iScienceLuvr)

#solidstatelife #ai #nlp #llms #computervision #chatgpt #captcha