#computervision

waynerad@diasp.org

Creating sexually explicit deepfakes to become a criminal offence in the UK. If the images or videos were never intended to be shared, under the new legislation, the person will face a criminal record and unlimited fine. If the images are shared, they face jail time.

Creating sexually explicit deepfakes to become a criminal offence

#solidstatelife #ai #genai #computervision #deepfakes #aiethics

waynerad@diasp.org

"The rise of generative AI and 'deepfakes' -- or videos and pictures that use a person's image in a false way -- has led to the wide proliferation of unauthorized clips that can damage celebrities' brands and businesses."

"Talent agency WME has inked a partnership with Loti, a Seattle-based firm that specializes in software used to flag unauthorized content posted on the internet that includes clients' likenesses. The company, which has 25 employees, then quickly sends requests to online platforms to have those infringing photos and videos removed."

This company Loti has a product called "Watchtower", which watches for your likeness online.

"Loti scans over 100M images and videos per day looking for abuse or breaches of your content or likeness."

"Loti provides DMCA takedowns when it finds content that's been shared without consent."

They also have a license management product called "Connect", and a "fake news protection" program called "Certify".

"Place an unobtrusive mark on your content to let your fans know it's really you."

"Let your fans verify your content by inspecting where it came from and who really sent it."

They don't say anything about how their technology works.

Hollywood celebs are scared of deepfakes. This talent agency will use AI to fight them.

#solidstatelife #ai #genai #computervision #deepfakes #aiethics

waynerad@diasp.org

Photorealistic AI-generated talking humans. "VLOGGER" is a system for generating video to match audio of a person talking. So you can make video of any arbitrary person saying any arbitrary thing. You just supply the audio (which could itself be AI-generated) and a still image of a person (which also could itself be AI-generated).

Most of the sample videos wouldn't play for me, but the ones in the top section did and seem pretty impressive. You have to "unmute" them to hear the audio and see that the video matches the audio.

They say the system works using a 2-step approach where the first step is to take just the audio signal, and use a neural network to predict what facial expressions, gaze, gestures, pose, body language, etc, would be appropriately associated with that audio, and the second step is to combine the output of the first step with the image you provide to generate the video. Perhaps surprisingly (at least to me), both of these are done with diffusion networks. I would've expected the second step to be done with diffusion networks, but the first to be done with some sort of autoencoder network. But no, they say they used a diffusion network for that step, too.

So the first step is taking the audio signal and converting to spectrograms. In parallel with that the input image is input into a "reference pose" network that analyses it to determine what the person looks like and what pose the rest of the system has to deal with as a starting point.

These are fed into the "motion generation network". The output of this network is "residuals" that describe face and body positions. It generates one set of all these parameters for each frame that will be in the resulting video.

The result of the "motion generation network", along with the reference image and the pose of the person in the reference image is then passed to the next stage, which is the temporal diffusion network that generates the video. A "temporal diffusion" network is a diffusion network that generates images, but it has been modified so that it maintains consistency from frame to frame, hence the "temporal" word tacked on to the name. In this case, the temporal diffusion network has undergone the additional step of being trained to handle the 3D motion "residual" parameters. Unlike previous non-diffusion-based image generators that simply stretched images in accordance with motion parameters, this network incorporates the "warping" parameters into the training of the neural network itself, resulting in much more realistic renditions of human faces stretching and moving.

This neural network generates a fixed number of frames. They use a technique called "temporal outpainting" to extend the video to any number of frames. The "temporal outpainting" system re-inputs the previous frames, minus 1, and uses that to generate the next frame. In this manner they can generate a video of any length with any number of frames.

As a final step they incorporate an upscaler to increase the pixel resolution of the output.

VLOGGER: Multimodal diffusion for embodied avatar synthesis

#solidstatelife #ai #computervision #generativeai #diffusionmodels

waynerad@diasp.org

3D AI Studio claims to create 3D models using AI. They say they can do text prompts and image input. The 3D models can be exported in many formats, including OBJ, STL, FBX, USD, and more.

Commercial product with free tier.

They don't say anything about how the system works.

3D AI Studio

#solidstatelife #ai #computervision #genai

waynerad@diasp.org

The Scalable, Instructable, Multiworld Agent (SIMA) from DeepMind plays video games for you. You tell it what you want to do in regular language, and it goes into a 3D environment, including some provided by commercial video games, and carries out keyboard-and-mouse actions.

Before getting into how they did this, might be worth citing some of the reasons they thought this was challenging: Video games can be open-ended, visually complex, and have hundreds of different objects. Video games are asynchronous -- no turn taking like chess or Go, or many research environments, which stop and wait while the agent computes its next action. Each instance of a commercial video game needs its own GPU -- no running hundreds or thousands of actors per game per experiment as has been historically done in reinforcement learning. AI agents see the same screen pixels that a human player gets -- no access to internal game state, rewards, or any other "privileged information". AI agents use the same keyboard-and-mouse controls that humans do -- no handcrafted action spaces or high-level APIs.

In addition to all those challenges, they demanded their agents follow instructions in regular language, rather than simply pursuing a high score in the game, and the agents were not allowed to use simplified grammars or command sets.

"Since the agent-environment interface is human compatible, it allows agents the potential to achieve anything that a human could, and allows direct imitation learning from human behavior."

"A key motivation of SIMA is the idea that learning language and learning about environments are mutually reinforcing. A variety of studies have found that even when language is not necessary for solving a task, learning language can help agents to learn generalizable representations and abstractions, or to learn more efficiently." "Conversely, richly grounded learning can also support language learning."

I figure you're all eager to know what the games were. They were: Goat Simulator 3 (you play the goat), Hydroneer (you run a mining operation and dig for gold), No Man's Sky (you explore a galaxy of procedurally-generated planets), Satisfactory (you attempt to build a space elevator on an alien planet), Teardown (you complete heists by solving puzzles), Valheim (you try to survive in a world of Norse mythology), and Wobbly Life (you complete jobs to earn money to buy your own house).

However, before the games, they trained SIMA in research environments. Those, which you probably never heard of, are: Construction Lab (agents are challenged to build things from construction blocks), Playhouse (a procedurally-generated house), ProcTHOR (procedurally-generated rooms, such as offices and libraries), and WorldLab (an environment with better simulated physics).

The SIMA agent itself maps visual observations and language instructions to keyboard-and-mouse actions. But it does that in several stages. For input, it takes a language instruction from you, and the pixels of the screen.

The video and language instruction both go through encoding layers before being input to a single, large, multi-modal transformer. The transformer doesn't output keyboard and mouse actions directly. Instead, it outputs a "state representation" that gets fed into a reinforcement learning network, which translates the "state" into what in reinforcement learning parlance is called a "policy". A more intuitive regular word might be "strategy". Basically this is a function that, when given input from the environment including the agent's state within the environment, will output an action. Here, the actions are the same actions a human would take with mouse and keyboard.

The multi-modal transformer was trained from scratch. A recent new algorithm called Classifier-Free Guidance (CFG) was used, an algorithm inspired by the algorithms used by diffusion models to "condition" the diffusion model on the text you, the user, typed in.

Even in the research environments, it is hard to automate judging of whether an agent completed its tasks. Instructions may be such things as, "make a pile of rocks to mark this spot" or "see if you can jump over this chasm". The environment may not provide any signal indicating these have been fulfilled. There are some they can handle, though, like "move forward", "lift the green cube", and "use the knife to chop the carrots".

For commercial video games, all the agent gets is pixels on the screen, just like a human player, and has no access to the internal game state of the game. The games generally don't allow any game state to be saved and restored, something researchers like for reproducibility.

For video games, they resorted to detecting on-screen text using OCR. They did this in particular for two games, No Man's Sky and Valheim, "which both feature a significant amount of on-screen text."

Why not just have people look, i.e. have humans judge whether the instructions were followed? Turns out humans were "the slowest and most expensive." They were able to get judgments from humans who were experts at the particular game an agent was playing, though.

For automated judgment, if a task contains a knife, a cutting board, and a carrot, the agent may ascertain the goal ("cut the carrot on the cutting board") without relying on the language instruction. This example illustrates the need to differentiate between following a language task and inferring the language task from "environmental affordances".

How'd SIMA do? It looks like its success rate got up to about 60% for Playhouse, but only about 30% for Valheim. That's the percentage of tasks completed. The ranking goes Playhouse, Worldlab, Satisfactory, Construction Lab, No Man's Sky, Goat Simulator 3, and Valheim.

"Note that humans would also find some of these tasks challenging, and thus human-level performance would not be 100%."

Grouped by "skill category", movement instructions ("stop", "move", "look") were the easiest, while food and resource gathering instructions ("eat", "cook", "collect", "harvest") were the hardest.

For No Man's Sky, they did a direct comparison with humans. Human's averaged 60%, while SIMA had around 30%.

How long til the AIs can beat the humans?

A generalist AI agent for 3D virtual environments

#solidstatelife #ai #genai #llms #computervision #multimodal #videogames

waynerad@diasp.org

"TripoSR: Fast 3D object reconstruction from a single image".

This is an impressive system where you can put in a single image, and it will generate a 3D model. They show videos going around the 3D model all 360-degrees.

I was, however, surprised and a bit disappointed to discover the output of this model is a neural radiance field, aka NeRF, not a traditional 3D model (using polygons) that you could plug into your existing videos games, or even a model using the newer Gaussian splatting technique. A neural radiance field, aka NeRF, is a neural network where you put in light rays as input, and it outputs pixel values for what you should see along that light ray. It's like neural ray tracing.

First a description of how the TripoSR system works. It builds on an earlier system called LRM, which stands for "large reconstruction model", and LRM in turn is built on DINO, and a generative adversarial network (GAN) that makes NeRFs from the output of DINO. Nobody knows what DINO stands for, but what DINO is is a vision transformer (ViT) that was made unusually large and trained in a "self-supervised" manner, analogous to how GPT is trained on language.

What the large reconstruction model (LRM) did is change DINO so it outputs an encoding known as a "3D triplane". That is, instead of outputting a 3D model in the form of 3D voxels, it outputs 3 2D planes. I never heard of this technique. The idea is that 3 2D planes gives you a lot of information that you could use to construct the 3D object, or at least its visible surface, without storing the massive amount of data that true 3D voxels would require. The 3D planes are oriented orthogonal to each other on combinations of x, y, and z axes, intersecting at the origin (0, 0, 0). Think of it as xy, xz, and yz planes. The way this works is you basically take the DINO vision transformer (ViT), which outputs image "features", and combine it with a new "image-to-triplane" encoder that takes the original image and the "feature" encodings from DINO, and makes a triplane.

The generative adversiarial network (GAN) comes into the picture because it was trained to turn 3D triplanes into neural radiance fields.

The great selling point of this system is that the whole process works without using the image you provide it as a "training" image and going through the training process (involving the whole backpropagation algorithm and gradient descent and the whole thing). In other words, when you put in your image, everything happens at "inference" time, with the network operating in feedforward-only mode. As such it can generate 3D models quickly and on only one GPU.

My guess is a big part of what makes this possible is the vast amount of "world knowledge" of what objects are likely to look like in 3D that is the result of the massive "self-supervised" DINO vision transformer (ViT) model.

Introducing TripoSR: Fast 3D object generation from single images

#solidstatelife #computervision

waynerad@diasp.org

The Velo AI smart bike light.

"In moving from radar to a camera-based solution, the aim was to create a device that could tell the cyclist 'a lot more about what's going on in the world and do a lot of things that the radar can't do'. The device would need to help cyclists by providing situational awareness and alerts about nearby vehicles. This includes the ability to distinguish, using computer vision algorithms, between different vehicle types, as well as to estimate their relative speed and to identify and predict driver behaviour."

"Raspberry Pi Compute Module 4 is in effect the brain of the Copilot, aided by a custom Hailo AI co-processor to run the neural networks required for the device's computer vision. A fixed-lens Arducam camera is used to record video footage."

"The Copilot is supplied with a mount to fix it to a bike's seat post or saddle rail, with the camera facing rearward. The AI analyses the live video footage and, depending on the type of driver behaviour detected, custom alerts may be triggered -- audible for the cyclist, and flashing LED light patterns to alert the driver behind."

The article doesn't mention it, but "velo" means bike in French.

Velo AI smart bike light - Raspberry Pi

#solidstatelife #ai #computervision #edgecomputing #raspberrypi

waynerad@diasp.org

"Sora AI: When progress is a bad thing."

This guy did experiments where he asked people to pick which art was AI generated and which art was human made. They couldn't tell the difference. Almost nobody could tell the difference.

To be sure, and "just to mess with people", he would tell people AI-generated art was made by humans and human art was made by AI and ask people to tell him how they could tell. People would proceed to tell him all the reasons why an AI-generated art piece was an amazing masterpiece clearly crafted by human hands -- with emotions and feelings. And when shown art made by a human and told it was AI-generated, people would write out a paragraph describing to me all the reasons how they could clearly tell why this was generated by AI.

That's pretty interesting but actually not the point of this video. The point of the video is that AI art generators don't give people the same level of control over art they make themselves, but it clearly has the understanding of, for example, what is a road and what is a car, and a basic understanding of physics and cause and effect of things.

He thinks we're very close to being able to take a storyboard and "shove it into the AI and it just comes up with the perfect 3D model based on the sketch, comes up with the skeletal mesh, comes up with the animations it -- infers details of the house based on your terrible drawings -- it manages the camera angles, creates the light sources, gives you access to all the key framing data and positions of each object within the scene, and with just a few tweaks you'd have a finished product. The ad would be done in like an hour or two, something that."

He's talking about the "Duck Tea" example in the video -- he made up a product called "Duck Tea" that doesn't exist and pondered what would be involved in making an ad for it.

"Would have taken weeks of planning and work, something that would have taken a full team a long time to finish, would take one guy one afternoon."

The solution: Vote for Michelle Obama because she will introduce Universal Basic Income?

Sora AI: When progress is a bad thing - KnowledgeHusk

#solidstatelife #ai #genai #diffusionmodels #computervision #aiethics

waynerad@diasp.org

So the claim is being made now that you can take any image -- a photo you've just taken on on your phone, a sketch that your child, or you just drew, or an image you generated using, say, Midjourney or DALL-E 3, and hand it to an AI model called Genie that will take the image and make it "interactive". You can control the main character and the scene will change around it. A tortoise made of glass, or maybe a translucent jellyfish floating through a post-apocalyptic cityscape.

"I can't help but point out the speed with which many of us are now becoming accustomed to new announcements and how we're adjusting to them."

"OpenAI Sora model has been out for just over a week and here's a paper where we can imagine it being interactive."

The Genie model is a vision transformer (ViT) model. That means it incorporates the "attention mechanism" we call "transformers" in its neural circuitry. That doesn't necessarily mean it "tokenizes" video, like the Sora model, but it does that, too. It also uses a particular variation of the transformer called the "ST-transformer" that is supposed to be more efficient for video. They don't say what the "ST" stands for but I'm guessing it stands for "spatial-temporal". It contains neural network layers that are dedicated to either spatial or temporal attention processing. This "ST" vision transformer was key to the creation of the video tokenizer, as what they did to create the tokenizer was use a "spatial-only" tokenizer (something called VQ-VAE) and modified it to do "spatial-temporal" tokenization. (They call their tokenizer ST-ViViT.)

(VQ-VAE, if you care to know, stands for "Vector Quantised Variational AutoEncoder". The term "autoencoder" means a combination of encoder and corresponding decoder. The "variational" part means the encoding in the middle is considered a latent "variable" that is designed to adhere to a predetermined statistical distribution. The "vector quantized" part means the vectors that come out are discrete, rather than continuous. I don't know how being discrete is advantageous in this context.)

After this there are two more neural network models. One of them takes the original frames, and one takes the video tokens and the output from the first model.

The first model is called the "latent action model". It takes video frames as input. Remember "latent" is just another word for "hidden". This is a neural network that is trained by watching videos all day. As it watches videos, it is challenged to predict later frames of video from previous frames that came before. In the process, it is asked to generate some parameters that describe what is being predicted. These are called the "latent actions". The idea is if you are given a video frame and the corresponding "latent actions", you can predict the next frames.

The second model is called the "dynamics" model. It takes tokens and the "latent actions" from the first model, and outputs video tokens.

Once all these models are trained up -- the tokenizer, the latent action model, and the dynamics model -- you're ready to interact.

You put in a photo of a tortoise made of glass, and now you can control it like a video game character.

The image you input serves as the initial frame. It gets tokenized and everything is tokenized from that point onward. The system can generate new video in a manner analogous to how a large language model generates new text by outputting text tokens. The key, though, is that by using the keyboard to initiate actions, you're inputing actions directly into the "latent actions" parameters. Doing so alters the video tokens that get generated, which alter all the subsequent video after that.

The researchers trained it on videos of 2D platformer games.

The AI "Genie" is out + humanoid robotics step closer

#solidstatelife #ai #genai #computervision

waynerad@diasp.org

AIflixhub aims to be your AI-generated movie platform. Commercial product with free tier. They say you can upload your existing assets, such as video clips, dialogue, sound effects, and music tracks. The system can combine these, and generate more, to produce your movie masterpiece. They say their AI tools to craft scripts, generate imagery, synthesize videos, create spoken dialogue, design sound effects, and compose soundtracks.

AIflixhub - AI-generated movie platform

#solidstatelife #ai #genai #computervision

waynerad@diasp.org

Claim is being made that a scientific research paper where every figure was AI generated passed peer review.

Article published a couple of days ago. Every figure in the article is AI generated and totally incomprehensible. This passed "peer-review"

#solidstatelife #ai #genai #computervision #deepfakes

waynerad@diasp.org

"Subprime Intelligence". Edward Zitron makes the case that: "We are rapidly approaching the top of generative AI's S-curve, where after a period of rapid growth things begin to slow down dramatically".

"Even in OpenAI's own hand-picked Sora outputs you'll find weird little things that shatter the illusion, where a woman's legs awkwardly shuffle then somehow switch sides as she walks (30 seconds) or blobs of people merge into each other."

"Sora's outputs can mimic real-life objects in a genuinely chilling way, but its outputs -- like DALL-E, like ChatGPT -- are marred by the fact that these models do not actually know anything. They do not know how many arms a monkey has, as these models do not 'know' anything. Sora generates responses based on the data that it has been trained upon, which results in content that is reality-adjacent."

"Generative AI's greatest threat is that it is capable of creating a certain kind of bland, generic content very quickly and cheaply."

I don't know. On the one hand, we've seen rapid bursts of progress in other technologies, only to be followed by periods of diminishing returns, sometimes for a long time, before some breakthrough leads to the next rapid burst of advancement. On the other hand, the number of parameters in these is much smaller than the number of synapses in the brain, which might be an approximate point of comparison, so it seems plausible that continuing to make them bigger will in fact make them smarter and make the kind of complains you see in this article go away.

What do you all think? Are we experiencing a temporary burst of progress soon to be followed by a period of diminishing returns? Or should we expect ongoing progress indefinitely?

Subprime Intelligence

#solidstatelife #ai #genai #llms #computervision #mooreslaw #exponentialgrowth

waynerad@diasp.org

Problem solving across 100,633 lines of code in Google Gemini 1.5 Pro.

The code is for generating some animations.

"What controls the animations on the littlest Tokyo demo?"

The model finds the demo and explains the animations are embedded within a glTF model. The videos doesn't explain what glTF is -- apparently it stands for "GL Transmission Format", where "GL" in turn stands for "graphics library" as it does in "OpenGL".

"Show me some code to add a slider to control the speed of the animation. Use that kind of GUI the other demos have."

The show the code and the slider which was added to the scene, which works.

Next, they give it a screenshot of a demo and asked where the code for it was.

There were a couple hundred demos in the system (they never say exactly how many) and it correctly finds the one the matches the image.

"How can I modify the code to make the terrain flatter?"

Gemini finds the function that generates the height and the exact line within the function to modify. It also provided an explanation of why the change worked.

For the last taks they show, they use a 3D text demo that says "three.js".

"How can I change the text to say, 'goldfish' and make the mesh materials look really shiny and metallic?"

Gemini finds the correct demo and showed the precise lines in it to change, along with an explanation of how to change material properties such as metalness and roughness to get a shiny effect.

Problem solving across 100,633 lines of code | Gemini 1.5 Pro demo - Google

#solidstatelife #ai #genai #computervision #llms #multimodal #google #gemini

waynerad@diasp.org

Reaction video to OpenAI Sora, OpenAI's system for generating video from text.

I encountered the reaction video first, in fact I discovered Sora exists from seeing the reaction video, but see below for the official announcement from OpenAI.

It's actually kind of interesting and amusing comparing the guesses in the reaction videos about how the system works from the way it actually works. People are guessing based on their knowledge of traditional computer graphics and 3D modeling. However...

The way Sora works is quite fascinating. We don't know the knitty-gritty details but OpenAI has described the system at a high level.

Basically it combines ideas from their image generation and large language model systems.

Their image generation systems, DALL-E 2 and DALL-E 3, are diffusion models. Their large language models, GPT-2, GPT-3, GPT-4, GPT-4-Vision, etc, are transformer models. (In fact "GPT" stands for "generative pretrained transformer").

I haven't seen diffusion and transformer models combined before.

Diffusion models work by having a set of parameters in what they call "latent space" that describe the "meaning" of the image. The word "latent" is another way of saying "hidden". The "latent space" parameters are "hidden" inside the model but they are created in such a way that the images and text descriptions are correlated, which is what makes it possible to type in a text prompt and get an image out. I've elsewhere given high-level hand-wavey descriptions of how the latent space parameters are turned into images through the diffusion process, and how the text and images are correlated (a training method called CLIP), so I won't repeat that here.

Large language models, on the other hand, work by turning words and word pieces into "tokens". The "tokens" are vectors constructed in such a way that the numerical values in the vectors are related to the underlying meaning of the words.

To make a model that combines both of these ideas, they figured out a way of doing something analogous to "tokens" but for video. They call their video "tokens" "patches". So Sora works with visual "patches".

One way to think of "patches" is as video compression both spatially and temporally. Unlike a video compression algorithm such as mpeg that does this using pre-determined mathematical formulas (discrete Fourier transforms and such), in this system the "compression" process is learned and is all made of neural networks.

So with a large language model, you type in text and it outputs tokens which represent text, which are decoded to text for you. With Sora, you type in text and it outputs tokens, except here the tokens represent visual "patches", and the decoder turns the visual "patches" into pixels for you to view.

Because the "compression" works both ways, in addition to "decoding" patches to get pixels, you can also input pixels and "encode" them into patches. This enables Sora to input video and perform a wide range of video editing tasks. It can create perfectly looping video, it can animate static images (why no Mona Lisa examples, though?), it can extend videos, either forward or backward in time. Sora can gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. I found these to be the most freakishly fascinating examples on their page of sample videos.

They list the following "emerging simulation capabilities":

"3D consistency." "Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space."

This is where they have the scene everyone is reacting to in the reaction videos, where the couple is walking down the street in Japan with the cherry blossoms.

By the way, I was wondering what kind of name is "Sora" so I looked it up on behindthename.com. It says there are two Japanese kanji characters both pronounced "sora" and both of which mean "sky".

"Long-range coherence and object permanence." "For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video."

"Interacting with the world." "Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks."

"Simulating digital worlds." "Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity."

However they say, "Sora currently exhibits numerous limitations as a simulator." "For example, it does not accurately model the physics of many basic interactions, like glass shattering."

This is incredible - ThePrimeTime

#solidstatelife #ai #genai #diffusionmodels #gpt #llms #computervision #videogeneration #openai

waynerad@diasp.org

"Comic Translate." "Many Automatic Manga Translators exist. Very few properly support comics of other kinds in other languages. This project was created to utilize the ability of GPT-4 and translate comics from all over the world. Currently, it supports translating to and from English, Korean, Japanese, French, Simplified Chinese, Traditional Chinese, Russian, German, Dutch, Spanish and Italian."

For a couple dozen languages, the best Machine Translator is not Google Translate, Papago or even DeepL, but GPT-4, and by far. This is very apparent for distant language pairs (Korean<->English, Japanese<->English etc) where other translators still often devolve into gibberish.

Works by combining neural networks for speech bubble detection, text segmentation, OCR, inpainting, translation, and text rendering. The neural networks for speech bubble detection, text segmentation, and inpainting apply to all languages, while OCR, translation, and text rendering are language-specific.

Comic Translate

#solidstatelife #ai #computervision #mt #genai #gpt #manga #anime

waynerad@diasp.org

CES 2024: "Looking into the future".

"They are putting AI in everything."

AI pillow, AI mattress, AI office chair, AI fridges, AI washers & dryers, AI smart lamps, AI grills, AI barbecue, AI cooking gear, AI pressure cookers, AI food processors, AI air fryers, AI stethoscopes, AI bird feeders, AI telescopes, AI backpacks, AI upscaling TVs, AI realtime language translator. Most require an internet connection to work.

Manufacturing and warehouse robots, delivery robots, lawn care robots, lawnmower robots, pool cleaning robots, robot bartender, robot barista, robots that cook stir fry and make you ice cream, robot front desk assistant, hospital robot, robot to throw tennis balls for your dog and feed your dog, robot that can roll around your house and project things on the wall.

Computer vision self-checkout without scanning barcodes, computer vision food tray scanner that tells you how many calories are in the food, whether it has any allergens, and other stuff to do with the food, computer vision for vehicles.

Augmented reality form factor that is just regular glasses, 3D video without glasses, VR roller coaster haptic suits.

A car with all 4 wheels able to move independently so itcan rotate in place and move sideways for parallel parking.

A 1-person helicopter, no steering mechanism, autonomous, a car with drone propellers on the roof that can fold into the car.

Giant LED walls everywhere, a ride where people sit on a seat hanging from the ceiling while moving through a world that's all on a giant LED screen.

Transparent TVs -- possibly great for storefronts.

A water maker pulling moisture from the air, home beer making, an automated manicure system, a mouth mask that enables your friends to hear you in a video game but nobody can hear you in real life.

Crazy AI tech everywhere (the CES 2024 experience) - Matt Wolfe

#solidstatelife #ai #genai #robotics #virtualreality #augmentedreality #computervision #ces2024