Tesla AI Day. Yeah, I know, lots of you have already seen the video. So I guess this is for the 3 people who haven't yet.
They think of their AI system as being analogous to the "visual cortex" in biological organisms. The problem they have is fusing the input from multiple cameras. A Tesla car has 8 cameras, which are high dynamic range (HDR) cameras with 1280x960 resolution that operate at 36 frames per second.
The solution to this problem they have opted for is for the neural networks that process the vision to output what they see in the form of 3D vectors, into what they call "vector space" and they can visualize this 3D "vector space" representation on a screen.
The processing first goes through residual networks (resnets), which are convolutional neural networks but the "residual" technique allows them to go much deeper than traditional convolutional networks. They like the fact that they can make the network deeper or shallower as they please to trade off vision processing with latency.
After the resnets, the data goes into something called a BiFPN, which stands for Bi-directional Feature Pyramid Network. They don't say much about what this network outputs, other than that it is "features", not images.
After this the data branches into multiple "heads". Each of the branches does something different: object detection, traffic lights, lane prediction, etc
After this, they do something called "rectification", which takes the vector space output and takes into account each camera's position and orientation and projects its output into the same 3D "vector space". The final fusion process uses a type of neural network called a transformer. These were originally invented for language translation and have an "attention" mechanism that enables the translation system to pay attention do different words in the input as it generates the output. Since then, "vision transformers" have been invented that enable the neural network to focus "attention" on a specific part of a scene. However, Tesla is not using standard vision transformers. They invented their own transformer which operates in "vector space". So it doesn't take images as its input, it takes sets of 3D vectors. What it outputs, at the end of the whole process, is a single unified 3D representation of the scene with curbs, lanes, traffic lights, other cars, pedestrians, and so on, identified.
This system has another trick of its sleeve. Everything up to here is just looking at camera input at a single point in time. But they enabled the system to understand motion over time. This is done with two "cache" systems. One of them is simply time based -- it remembers the last few seconds of whatever the car has seen. The second is space based. So if, for example, the Tesla car sits at a red light, it can remember lane markings it has seen many seconds ago because they are in the "space based" cache and it remembers the space it recently drove past or over.
These "caches" are combined with a recurrent neural network. This combination allows the system to keep track of the structure of the road over time, and the system handles remembering cars when they are temporarily occluded very well.
After all this, the data goes into the planning and control system. For this he shows an example of changing lanes to make a left turn, and says the path planning system does 2,500 path searches in 1.5 milliseconds.
The planning system plans for everything in a scene, including other cars and pedestrians. He shows an example where the car is driving down a narrow street where we can pull aside and yield for another car or they can pull aside and yield for us. If the other car yields, our car knows what to do because it created that plan for the other car.
He shows a visualization of an A* backtracking algorithm, and how it is too computationally expensive and says they are developing a neural network, borrowing the design from AlphaGo, to optimize "Monte Carlo Tree Search", which AlphaGo also does.
You might be surprised that up until this point, the system does not use neural networks, but uses traditional computer science path planning algorithms. In the Q&A section, Elon Musk reveals that these are written in C++. He says neural networks shouldn't be used unless they have to be, and for vision they have to be, but since path planning doesn't have to be it's written in C++.
I would think this system would have trouble working in places with chaotic driving without clear rules, and the presenter acknowledges the system won't work in other places like India, where he himself happens to be from.
Next they talk about data set labelling. Originally they labeled images, but they switched to labeling in 3D vector space. They developed a UI where people can move things in vector space and see the projection in multiple photographs.
He talks about an auto-labeling system, but I didn't really understand how it works. Apparently it can combine data from multiple cars and reconstruct the road surface and walls and other parts of the scene from the video from multiple cars going through the same place. It also does a good job handling occlusions of moving objects such as cars and pedestrians.
They went to the next level by creating a simulator. It makes pretty realistic video. Of course since the simulation is computer-generated the vector space can automatically be correctly labeled and produce massive amounts of training data. The simulation system even simulates the characteristics of the cameras in the cars, such as adding sensor noise and simulating the effect the sun has on the camera. Neural networks are used to enhance the images and make them look even more realistic.
The main purpose of the simulator, though, isn't just to create massive amounts of training data but to create lots of examples of accidents and other edge cases that occur infrequently in real life. Speeding police cars, and so on. Most of the environments are algorithmically created, not created by human artists, so there is a potentially unlimited amount of roads to train from.
Before putting the models in cars, they do extensive testing, with 1 million evaluations/week on every code change. They developed their own debugging tools so you can see the outputs of multiple different revisions of the software side by side.
The rest of the talk is about Dojo, Tesla's upcoming supercomputer.
Basically what they did is create a supercomputer for learning how to drive. They start the process by designing a training node, which is a CPU combined with dedicated hardware for matrix operations (the core operations in any AI system), hardware for parallel floating point and integer math (similar to a DSP chip), SRAM, and communication hardware. The CPU has 4 threads and an instruction set designed specifically for machine learning (so it's not using a general instruction set such as x86 or ARM). 354 of these "training nodes" are manufactured on a single chip, called the D1 chip, with high-speed communication from each node to its adjacent nodes on 4 sides. It has 50 billion transistors on a single 645 millimeter chip manufactured at 7 nm.
With these D1 chips, the plan is to take 500,000 D1 chips and connect them with "Dojo interface processors", which in turn connect to outside computers. The D1 chips are organized into "training tiles". They created their own power supply and cooling systems for these "tiles". The tiles are placed in an "exapod" where 10 cabinets are combined and the walls removed so the tiles can communicate directly with each other without cabinet walls getting in the way.
They made their own compiler to compile PyTorch models and other code for the hardware.
Basically, they created a supercomputer specialized, from the transistors themselves on up, for one specific task, which is training vision neural networks.
#solidstatelife #ai #computervision #autonomousvehicles #tesla