15 min read
Over the last several months, I have wasted countless hours read through and collected online posts related to several conversational spikes that were triggered by current events. These conversational spikes contained multiple examples of outright misinformation and artificial amplification of this misinformation.
I published three writeups describing this analysis: one on a series of four spikes related to Ilhan Omar, a second related to the suicide of Jeffrey Epstein, and a third related to trolls and sockpuppets active in the conversation related to Tulsi Gabbard. For these analyses, I looked at approximately 2.7 million tweets, including the domains and YouTube videos shared.
Throughout each of these spikes, right leaning and far right web sites that specialize in false or misleading information were shared far more extensively than mainstream news sources. As shown in the writeups, there was nothing remotely close to balance in the sites shared. Rightwing sites and sources dominated the conversation, both in number of shares, and in number of domains shared.
This imbalance led me to return to a question I looked at back in 2017: is there a business model or a sustainability model for publishing misinformation and/or hate? This is a question multiple other people have asked; as one example, Buzzfeed has been on this beat for years now.
To begin to answer this question, I scanned a subset of the sites used when spreading or amplifying misinformation, along with several mainstream media sites. This scan had two immediate goals:
- get accurate information about the types of tracking and advertising technology used on each individual site; and
- observe overlaps in tracking technologies used across multiple sites.
Both mainstream news sites and misinformation sites rely on advertising to generate revenue.
The companies that sell ads collect information about people, the devices they use, and their geographic location (at minimum, inferred from IP addresses, but also captured via tracking scripts), as part of how they sell and deliver ads.
This scan will help us answer several questions:
- what companies help these web sites generate revenue?
- what do these adtech companies know about us?
- given what these companies know about us, how does that impact their potential complicity in spreading, supporting, or profiting from misinformation?
25 sites were scanned -- each site is listed below, followed by the number of third parties that were called on each site. The sites selected for scanning meet one or more of the following criteria: were used to amplify false or misleading narratives on social media; have a track record of posting false or misleading content; are recognized as a mainstream news site; are recognized as a partisan but legitimate web site.
Every site scan began by visiting the home page. From the home page, I followed a linked article. From the linked article, I followed a link to another article within the site, for a total of three pages in each site.
On each pageload, I allowed any banner ads to load, and then scrolled to the bottom of the page. A small number of the sites used "infinite scroll" - on these sites, I would scroll down the equivalent of approximately 3-4 screens before moving on to a new page in the site.
While visiting each site, I used OWASP ZAP (an intercepting proxy) to capture the web traffic and any third party calls. For each scan, I used a fresh browser with the browsing history, cookies, and offline files wiped clean.
The list of sites scanned are listed below, sorted in order of observed trackers, from low to high.
The sites at the top of the list shared information about site visitors with more third party domains. In general, each individual domain is a different company, although in some cases (like Google and Facebook) a single company can control multiple domains. This count is at the domain level, so if a site sent user information to subdomain1.foo.com and subdomain2.foo.com, the two distinct subdomains count as a single site.
- dailycaller (dot) com -- 189
- thegatewaypundit (dot) com -- 160
- thedailybeast (dot) com -- 154
- mediaite (dot) com -- 153
- dailymail.co.uk -- 151
- zerohedge (dot) com -- 145
- cnn (dot) com -- 143
- westernjournal (dot) com -- 140
- freebeacon (dot) com -- 137
- huffpost (dot) com -- 131
- breitbart (dot) com -- 107
- foxnews (dot) com -- 101
- twitchy (dot) com -- 92
- thefederalist (dot) com -- 88
- townhall (dot) com -- 83
- washingtonpost (dot) com -- 82
- dailywire (dot) com -- 71
- pjmedia (dot) com -- 61
- lauraloomer.us -- 52
- nytimes (dot) com -- 42
- infowars (dot) com -- 40
- vdare (dot) com -- 21
- prageru (dot) com -- 19
- reddit (dot) com -- 18
- actblue (dot) com -- 13
The list below highlights the most commonly used third party domains. The list breaks out the domain, the number of times it was called, and the company that owns the domain. As shown below, the top 24 third parties were all called by 18 or more sites.
The top 24 third party sites getting data include some well known names in the general tech world, such as Google, Facebook, Amazon, Adobe, Twitter, and Oracle.
However, lesser known companies are also broadly used, and get access to user information as well. These less known companies collecting information about people's browsing habits include AppNexus, MediaMath, The Trade Desk, OpenX, Quantcast, RapLeaf, Rubicon Project, comScore, and Smart Ad Server.
Top third party domains called:
- doubleclick.net - 25 - Google
- googleapis.com - 24 - Google
- facebook.com - 23 - Facebook
- google.com - 23 - Google
- google-analytics.com - 22 - Google
- googletagservices.com - 22 - Google
- gstatic.com - 22 - Google
- adnxs.com - 21 - AppNexus
- googlesyndication.com - 21 - Google
- adsrvr.org - 20 - The Trade Desk
- mathtag.com - 20 - MediaMath
- twitter.com - 20 - Twitter
- yahoo.com - 20 - Yahoo
- amazon-adsystem.com - 19 - Amazon
- bluekai.com - 19 - Oracle
- facebook.net - 19 - Facebook
- openx.net - 19 - OpenX
- quantserve.com - 19 - Quantcast
- rlcdn.com - 19 - RapLeaf
- rubiconproject.com - 19 - Rubicon Project
- scorecardresearch.com - 19 - comScore
- ampproject.org - 18 - Google
- everesttech.net - 18 - Adobe
- smartadserver.com - 18 - Smart Ad Server (partners with Google and the Trade Desk)
The full list of domains, and the paired third party calls, are available on Github.
As noted above, Doubleclick -- an adtech and analytics service owned by Google -- is used on every single site in this scan. We'll take a look at what that means in practical terms later in this post. But other domains are also used heavily across multiple sites.
amazon-adsystem.com -- controlled by Amazon -- was called on 19 sites in the scan, including Mediaite, CNN, Reddit, Huffington Post, the Washington Post, the NY Times, Western Journal, PJ Media, ZeroHedge, the Federalist, Breitbart, and the Daily Caller.
adsrvr.org -- a domain that appears to be owned by The Trade Desk, was called on 20 sites in the scan, including Breitbart, PJMedia, ZeroHedge, The Federalist, CNN, Mediaite, Huffington Post, and the Washington Post.
Stripe -- a popular payment platform -- was called on right wing sites to outright hate sites. While I did not confirm that each payment gateway is active and functional, the chances are good that Stripe is used to process payments on some or all of the sites where it appears. Sites where calls to Stripe came up in the scan include VDare (a white nationalist site), Laura Loomer, Breitbart, and Gateway Pundit.
Stripe is primarily a payment processor, and is included here to show an additional business model -- selling merchandise -- used to generate revenue. However, multiple adtech and analytics providers are used indiscriminately on sites across the political spectrum. While some people might point to the ubiquity and reuse of adtech across the political spectrum -- and across the spectrum of news sites, from mainstream to highly partisan sites, to hate sites and misinformation sites -- as a sign of "neutrality", it is better understood as an amoral stance.
Adtech helps all of these sites generate revenue, and helps all of these sites understand what content "works" best to generate interaction and page views. When mainstream news sites use the same adtech as sites that peddle misinformation, the readers of mainstream sites have their reading and browsing habits stored and analyzed alongside the browsing habits of people who live on an information diet of misinformation. In this way, when mainstream news sites choose to have reader data exposed to third parties that also cater to misinformation sites, it potentially exposes these readers to advertising designed for misinformation platforms. In the targeted ad economy, one way to avoid being targeted is to be less visible in the data pool, and when mainstream news sites use the same adtech as misinformation sites, they sell us out and increase our visibility to targeted advertisers.
Looking at this from the perspective of an adtech or analytics vendor, they have the most to gain financially from selling to as many customers as possible, regardless of the quality or accuracy of the site. The more data they collect and retain, the more accurate (theoretically) their targeting will become. The ubiquity of adtech used across sites allows adtech vendors to skim profit off the top as they sell ads on web properties working in direct opposition to one another.
In short, while our information ecosystem slowly collapses under the weight of targeted misinformation, adtech profits from all sides, and collects more data from people being misled, thus allowing more accurate targeting of people most susceptible to misleading content over time. Understood this way, adtech has a front row seat to the steady erosion of our information ecosystem, with a couple notable caveats: first, with the dataset adtech has collected and continues to grow, they could identify the most problematic players. Second, adtech profits from lies just as much as truth, so they have a financial incentive to not care.
But don't take my word for it. In January 2017, Randall Rothenberg, the head of the Interactive Advertising Bureau (IAB, the leading trade organization for online adtech), described this issue:
We have discovered that the same paths the curious can trek to satisfy their hunger for knowledge can also be littered deliberately with ripe falsehoods, ready to be plucked by – and to poison – the guileless.
In his 2017 speech, Rothenberg correctly observes that advertising has what he describes as a "civic responsibility":
Our objective isn’t to preserve marketing and advertising. When all information becomes suspect – when it’s not just an ad impression that may be fraudulent, but the data, news, and science that undergird society itself – then we must take civic responsibility for our effect on the world.
In the same speech in 2017, Rothenberg highlights the interdependence of adtech and the people affected by it, and the responsibilities that requires from adtech companies.
First, let me dispense with the fantasy that your obligation to your company stops at the door of your company. For any enterprise that has both customers and suppliers – which is to say, every enterprise – is a part of a supply chain. And in any supply chain, especially one as complex as ours in the digital media industry, everything is interdependent – everything touches something else, which touches someone else, which eventually touches everyone else. No matter how technical your company, no matter how abstruse your particular position and the skill it takes to occupy it, you cannot divorce what you do from its effects on the human beings who lie, inevitably, at the end of this industry’s supply chain.
Based on what is clearly observable in this scan of 25 sites that featured heavily in misinformation campaigns, nearly three years after the head of the IAB called for improvements, actual improvements appear to be in very short supply.
Tracking Across the Web
To illustrate how tracking looks in practice, I did a sample scan across six web sites: Gateway Pundit Breitbart PJ Media Mediaite The Daily Beast The Federalist
While all of these sites use dozens of trackers, for reasons of time we will limit our review to two: Facebook and Google. Also, to be very clear: the proxy logs for this scan of six sites contains an enormous amount of information about what is collected, how it's shared, and the means by which data are collected and synched between companies. The discussion in this post barely scratches the surface, and this is an intentional choice. Going into more detail would have required a deeper dive into the technical implementation of tracking, and while this deeper dive would be fun, it's outside the scope of this post.
In the screenshots below, the urls sent in the headers of the request, the User Agent information, and the full cookie ID are partially obfuscated for privacy reasons.
Facebook sets a cookie on the first site: Gateway Pundit. This cookie has a unique ID, which gets reused across multiple sites. The initial request sent to Facebook includes a timestamp, and basic information about the system used to access the site (details like operating system, browser, browser version, and screen height and width). The request also includes the time of day, and the referring URL.
At this point, Facebook doesn't need much more flesh out a device fingerprint to map to this ID to a specific device. However, a superficial scan of multiple scripts loaded by domains affiliated with Facebook suggest that Facebook collects adequate data to generate a device fingerprint, which would allow them to then tie that more specific identifier to different cookie IDs over time.
The cookie ID is consistently included in headers across multiple web sites. In the screenshot below, the cookie ID is included in a request on Breitbart:
And PJ Media:
And the Daily Beast:
And the Federalist:
Google (or more specifically, Doubleclick, which is owned by Google) works in a similar way as Facebook.
The initial Doubleclick cookie, with a unique value, gets set on the first site, Gateway Pundit. As with Facebook, this cookie is repeatedly included in header requests on every site in this scan.
Here, we see the same ID getting included in the header on PJ Media:
And on Breitbart:
As with Facebook, Google repeatedly gets browsing information, and information about the device doing the browsing. This information is tied to a common identifier across web sites, and this common identifier can be tied to a device fingerprint, which can be used to precisely identify individuals over time. The data collected by Facebook and Google in this scan includes specific URLs accessed, and patterns of activity across the different sites. Collectively, over time, this information provides a reasonably clear picture of a person's habits and interests. If this information is combined with other data sets -- like search history from Google, or group and interaction history from Facebook, we can begin to see how browsing patterns provide an additional facet that can be immensely revealing as part of a larger profile.
Conclusion, or Thoughts on Why this Matters
As has been demonstrated, it's not difficult to identify the location of specific individuals using even rudimentary adtech tools.
Given the opacity of the adtech industry, it can be difficult to detect and punish fraudulent behavior -- such as what happened with comScore, an adtech service used in 19 of the 25 sites scanned.
As social media platforms -- who are also adtech vendors and data brokers -- flail and fail to figure out their role, the ability to both amplify questionable content and to target people using existing adtech services provide powerful opportunities to influence people who might be prone to a nudge. This is the promise of advertising, both political and consumer, and the tools for one are readily adaptable for the other.
Adtech both profits from and extends information asymmetry. The companies that act as data brokers and adtech vendors know far more about us than we do about them. Web sites pushing misinformation -- and the people behind these sites -- can potentially use this stacked deck to underwrite and potentially profit from misinformation.
Adtech in its current form should be understood as a parasite on the news industry. When mainstream news sites throw money and data into the hands of adtech companies that also support their clear enemies, mainstream sites are actively undermining their long term interests.
Conversely, though, the adtech companies that currently profit from the spread of misinformation, and the targeting of those who are most susceptible to it, are sitting on the dataset that could help counter misinformation. The same patterns that are used to target ads and analyze individuals susceptible to those ads could be put to use to better understand -- and dismantle -- the misinformation ecosystem. And the crazy thing, and a thing that could provide hope: all it would take is one reasonably sized company to take this on.
If one company decided that, finally, enough is enough, they could theoretically work with researchers to develop an ethical framework that would allow for a comprehensive analysis of the sites that are central to spreading specific types of misinformation. While companies like Google, Facebook, Amazon, Appnexus, MediaMath, the Trade Desk, comScore, or Twitter have shown no inclination to tackle this systematically, countless smaller companies would potentially have datasets that are more than complete enough to support detailed insights.
Misinformation campaigns are happening now, across multiple platforms, across multiple countries. The reasons driving these campaigns vary, but the tradecraft used in these campaigns has overlaps. While adtech currently supports people spreading misinformation, it doesn't need to be this way. The same data that are used to target individuals could be used to counter misinformation, and make it more difficult to profit from spreading lies.