How "top heavy" is YouTube? 65% of videos have fewer than 100 views. 87% have fewer than 1,000 views. Only 3.7% of videos exceed 10,000 views, which is the threshold for monetization. Those 3.7% of views get 94% of views. The top 0.16% of videos get 50% of video views.

In other words, video views on YouTube follow a power law distribution, as you might have expected, but it's a lot steeper than you might have expected.

How was this figured out? Using a new but simple technique called "dialing for videos".

You may not realize it, but those YouTube IDs that look like a jumble of letters and numbers, like "A-SyeJaMMjI", are actually numbers. Yes, all YouTube video IDs are actually numbers. They're just not written in base 10. They're 64-bit numbers written in base 64. If you're wondering how YouTube came up with 64 digits, think about it: digits 0-9 give you 10, then lower case letters a-z give you 26 more, bringing you up to 36, then uppercase letters A-Z give you 26 more, getting you up to 62. You still need 2 more for that, and YouTube chose the dash ("-") and the underscore ("_").

Because the numbers are randomly chosen 64-bit numbers, there are 2^64 possibilities, which in decimal is 18,446,744,073,709,551,616. That's much too large to try every number or even numbers at random. But the researchers discovered a quirk. Through the YouTube API they could do searches, and YouTube would do the search in a case-insensitive way. Well, except not for the last character for some reason. And it would allow 32 IDs to be searched on in the same query. So the researchers were about to find 10,000 videos (well, 10,016 actually) by doing millions of searchers. This collection of 10,000 videos is likely to be more representative of all of YouTube than any other sample academic researchers have ever hard. All previous attempts have resulted in biased results because they were influenced either by the recommendation system, personalized search results, or just whatever secret algorithms YouTube has that determines how it ranks videos that it enables you to find.

How big is YouTube? Their estimate is 9.8 billion videos. Or at least that's how big it was between October 5, 2022, and December 13, 2022, which is when they did their data collection. Their paper was finally published last December.

By looking at what percentage of their sample were uploaded in any given year, they can chart the growth of YouTube:

Year - Percentage of sample
2005 - 0.00%
2006 - 0.05%
2007 - 0.22%
2008 - 0.43%
2009 - 0.74%
2010 - 1.13%
2011 - 1.67%
2012 - 1.86%
2013 - 1.97%
2014 - 2.34%
2015 - 3.02%
2016 - 4.25%
2017 - 5.39%
2018 - 6.73%
2019 - 8.81%
2020 - 15.22%
2021 - 20.29%
2022 - 25.91%

Translating those numbers into millions of videos (remember, a thousand million is a billion), we get this list:

2005 - 0
2006 - 5
2007 - 27
2008 - 69
2009 - 142
2010 - 254
2011 - 418
2012 - 602
2013 - 796
2014 - 1,072
2015 - 1,325
2016 - 1,745
2017 - 2,278
2018 - 2,943
2019 - 3,813
2020 - 5,316
2021 - 7,321
2022 - 9,881

73% of videos had no comments. 1.04% of videos had 100 comments or more, and those accounted for 55% of all comments in the sample.

"Likes" are evn more skewed, with 0.08% of videos getting 55% of likes.

YouTube disabled the "Dislike" buttons in 2021.

Most channels had at least one subscriber and the average was 65. Subscriber counts, while less "top heavy", turned out to be weakly correlated with views. The researchers estimate 70% of views of any given video come from algorithms and not from subscribers or external links pointing to a video.

Median video length was 615 seconds (10 minutes, 15 seconds). 6.2% were 10 seconds or less, 38% were 1 minute or less, 82% were ten minutes or less, and only 3.9% were an hour or more.

Words that occurred most in metadata tags included "Sony" and "Playstation".

The researchers employed hand-coders to hand-code a subsample of 1,000 videos. They found only 3% of videos had anything to do with news, politics, or current events. 3.8% had anything to do with religion. 15.5% had just still images for the video part. (I actually see a lot of music videos like this -- just an album cover or photo of the artist and the rest is audio.). 19.5% were streams of video games. 8.4 was computer-generated but not a video game. 14.3% had a background design indicating they were produced on some sort of set. 84.3% were edited. 36.7% had text or graphics overlaid on the video. 35.7% was recorded indoors. 18.1% was recorded outdoors. (The remainder were both or unclear.) Cameras were "shaky" 52.3% of the time. A human was seen talking to the camera 18.3% of the time. 9.1% of videos recorded a public event. The video was something obviously not owned by the uploader, such as a movie clip, 4.8% of the time.

Sponsorships and "calls to action" were only present in 3.8% of videos.

96.8% of videos had audio. 40.5% were deemed by coders to be entirely or almost entirely music. Many of these were backgrounds for performances, video game footage, or slide shows.

53.8% had spoken language. 28.9% had spoken language on top of music.

For languages, "we built our own language detection pipeline by running each video's audio file using the VoxLingua107 ECAPA-TDNN spoken language recognition model."

Language distribution was:

English: 20.1%
Hindi: 7.6%
Spanish: 6.2%
Welsh: 5.7%
Portuguese: 4.9%
Latin: 4.6%
Russian: 4.2%
Arabic: 3.3%
Javanese: 3.3%
Waray: 3.2%
Japanese: 2.2%
Indonesian: 2.0%
French: 1.8%
Icelandic: 1.7%
Urdu: 1.5%
Sindhi: 1.4%
Bengali: 1.4%
Thai: 1.2%
Turkish: 1.2%
Central Khumer: 1.1%

"It is unlikely that Welsh is the fourth most common language on YouTube, for example, or that Icelandic is spoken more often than Urdu, Bengali, or Turkish. More startling still is that, according to this analysis Latin is not a 'dead language' but rather the sixth most common language spoken on YouTube. Of the top 20, Welsh, Latin, Waray-Waray, and Icelandic are not in the top 200 most spoken languages, and Sindhi and Central Khmer are not in the top 50 (Ethnologue, 2022). The VoxLingua107 documentation notes a number of languages which are commonly mistaken for another (Urdu for Hindi, Spanish for Galician, Norwegian for Nynorsk, Dutch for Afrikaans, English for Welsh, and Estonian for Finnish), but does not account for the other unusual results we have seen. We thought that some of the errors may be because of the amount of music in our sample, but removing the videos that are part of YouTube Music (which does not include all music) did not yield significantly different results."

"It is worth highlighting just how many of the most popular languages are not among the languages available in the YouTube autocaptioning system: Hindi, Arabic, Javanese, Waray-Waray, Urdu, Thai, Bengali, and Sindhi."

Dialing for videos: A random sample of YouTube

#solidstatelife #discoveries #winnertakeall #powerlawdistribution #youtube