Update on "Personal Email, School-Required Software, and Ad Tracking"

4 min read

I just re-ran the scan that, earlier this week, found what appeared to be advertising-related tracking in Canvas when a student logged in to Canvas after logging in to a personal GMail account.

The latest round of tests showed very different behavior: the tracking that was observed in the earlier tests is not present in the more recent tests. This change appears to have happened since I put out my original blog post approximately 36 hours ago. The technical details are in my original writeup (linked above), but the short version:

  • In the original scan, after logging into Canvas, there were two subdomains connected via redirects: "google.com/ads" and "stats.g.doubleclick.net". Calls to these subdomains appeared to map cookie IDs set for advertising to Canvas's Google Analytics ID.
  • In the original scan, after logging into Canvas, these subdomains were called multiple times (at least three times each over approximately 90 seconds of browsing).
  • In the most recent scan, after logging into Canvas, using an identical script to the original scan, these subdomains and the related cookie IDs are not called at all.

Fixed?

Viewed through a privacy lens, the removal of the cookie mapping is a good thing. It's an interesting shift, and raises a few questions and possibilities. I will attempt to include as many of these as possible, even options that are fairly unlikely.

  1. the fix for the issue I flagged in my post was already in the development pipeline and was deployed yesterday right on schedule;
  2. the ID mapping was part of a larger strategic plan and was removed intentionally;
  3. the ID mapping was in place as a result of human error, and this was addressed;
  4. the issue was related to how Google deploys Analytics, and Google made a change on their end completely unrelated to anything I observed;
  5. my original tests reported a bug or some other aberration that was subsequently fixed;
  6. ???

In my opinion -- based both on past experience with issues like this, and just a gut feeling (which for all obvious reasons, doesn't mean much) -- the third option (human error) feels most likely.

Regardless of the reason, I would strongly advise Instructure to provide a clear, transparent, and complete breakdown of what exactly happened here. There are range of plausible and reasonable explanations -- but students and families that have their information entrusted to Instructure deserve a clear, transparent, and complete explanation.

Taking a step back, this is an issue that goes beyond Instructure. While Instructure had the bad luck to be the vendor included in this scan, we need to look long and hard at the reliance the edtech industry places on Google Analytics.

Analytics data are tracking data, and can easily be repurposed to support profiling and advertising. Google Analytics is increasingly transparent about this, but we shouldn't pretend that analytics from other services can't be used in similar ways. Google describes the relationship very clearly:

When you link your Google Analytics account to your Google Ads account, you can:

  • view Google Ads click and cost data alongside your site engagement data in Google Analytics;
  • create remarketing lists in Analytics to use in Google Ads campaigns;
  • import Analytics goals and transactions into Google Ads as conversions; and
  • view Analytics site engagement data in Google Ads.

The distinctions made between educational data/student data and consumer data are often contrived, and the protections offered over "educational" data are fragile. Instead of thinking about "student data," we would be better off thinking about data that are collected in an educational setting -- and we would be even better off with real privacy protections that protected the rights of individuals regardless of where the data were collected.

Personal Email, School-Required Software, and Ad Tracking

18 min read

UPDATE December 21, 2019: After I put this post out, I re-ran the scan as part of routine follow up. The cookie mapping that was observed in the original scan and documented in this post is no longer present. It's not clear how or why this shift occurred, but at some point between the original scan, publishing this writeup, and a new scan completed after this writeup was published, the tracking behavior observed within Canvas has changed. More details are available here. END UPDATE.

Recently, a friend reached out to me with some questions about ad tracking, and the potential for ad tracking that may or may not occur when a learner is using a Learning Management System (or LMS) provided by a school. LMSs are often required by schools, colleges, and universities. LMSs hold a unique spot in student learning, effectively positioned between students, faculty, and the work both need to do to succeed and progress.

With the central placement of LMSs in mind, we wanted to look at a common use case for students required to use an LMS as part of their daily school experience. In particular, we wanted to look at the potential for third party tracking when students do a range of pretty normal things: check their personal email, search and find information, watch a video, and check an assignment for school. The tasks in this scan take a person about five to seven minutes to complete.

The account used for testing is from a real student above the age of 13 in a K12 setting in the United States. The LMS accessed in the test is Canvas from Instructure, and the LMS is required for use in the school setting. The full testing scenario, additional details on the testing process, and screenshots documenting the results, are all available below.

Summary and Overview

The scan described in this post focuses on one question: if a high school student has a personal GMail account and is required to use a school provided LMS with a school provided email, what ad tracking could they be exposed to via regular web browsing?

In this scan, we observed tracking cookies set on a person's browser almost immediately after logging into their consumer GMail account. These tracking cookies were used to track the person as they searched on Google and YouTube, and as they browsed a popular site focused on providing medical information. Because the GMail account used for the scan is a consumer GMail account, the observed tracking is not unexpected.

However, when the student logged into Canvas, the LMS provided by their school, using their school-provided email address which is not a GSuite account, we also observed the same ad tracking cookies getting synched to the LMS' Google Analytics tracking ID. This synchronization clearly occurred when the student was logged into the LMS.

This tracking activity raises several questions, but in this summary we will limit the scope to three:

  1. Why is a Google Analytics ID being mapped to tracking cookies that are tied to an individual identity and set in an ad tracking context?
  2. Why is the LMS -- in this example, Canvas -- using Analytics that potentially exposes learners to ad tracking?

These two questions lead into the third question, which will be the subject of follow up scans: given the large number of educational sites that also use Google Analytics, can similar mapping of Google Analytics IDs to adtech cookie IDs be observed on other educational sites?

The analysis of the scan is broken into multiple sections, and each section has a "Breakpoint" that summarizes the report.

  • Testing Scenario: The steps used in this scan to allow anyone to replicate this work.
  • Testing Process: The process used to set up for the scan.
  • Results: The full results of the scan.
  • Breakpoint 1: A summary of the process that sets the tracking cookies after a person logs in to a consumer GMail account.
  • Breakpoint 2: Search activity on Google.
  • Breakpoint 3: Ad tracking on the Mayo Clinic site.
  • Breakpoint 4: Search activity on YouTube.
  • Breakpoint 5: Mapping of Instructure's Google Analytics IDs to ad tracking IDs.
  • Additional Scans: Follow up work indicated by this scan.
  • Conclusions: Takeaways and observations from the scan.

Testing Scenario

The scan was run using a real GMail account, and a real school email account provisioned by a public K12 school district in the United States. The owner of both accounts is over the age of 13. The school email account was not a GSuite EDU account. The LMS used to run this test was Canvas from Instructure. The testing scan used these steps:

A. Consumer Google Account

  1. Log in at google.com
  2. Go to email
  3. Read an email
  4. Return to google.com
  5. Search for "runny nose"

B. Medical Information

  1. View the top hit from Mayo Clinic or WebMD

C. YouTube

  1. Go to YouTube.com
  2. Search for "runny nose"
  3. View the top hit for 90 seconds
  4. Watch one of the top recommended videos for 90 seconds.

D. School-supplied LMS in K12

  1. Go to Canvas login page and log in using a school-provided email address
  2. Navigate course materials (approximately 10 clicks to access assignments and notes)
  3. Return to student dashboard
  4. Log out of Canvas

Testing Process

The testing used a clean browser with all cookies, cache, browsing history, and offline content deleted prior to beginning the scan. The GMail account used had not modified or altered the default settings.

Web traffic was observed using OWASP ZAP, an intercepting proxy.

Results

In summarizing the results, we will focus on tracking that happens related to Google, and while logged in to Canvas. This analysis does not get into the tracking that Canvas does, or the tracking and data access permitted by Canvas via Canvas's APIs. For a good analysis of the tracking and access that Canvas allows via their APIs, read Kin Lane's breakdown of the data elements supported by Canvas's public APIs.

This post looks at one specific question: if a person is both browsing the web and using their school-provided LMS, what could tracking look like? The results described here provide a high level summary of the full scan; for reasons of focus and brevity, we only cover observed tracking from Google. Other entities that appear in this scan also get data, but Google gets data throughout the testing script.

In the scan, multiple services set multiple IDs. The analysis in this post highlights two IDs set by Google; these two IDs merit a higher level of attention because they are called across multiple sites, are mapped to one another, and are mapped to a separate Google Analytics ID connected to Canvas. In the scan, mapping Google Analytics IDs to IDs that appear to be connected to ad tech happens on both sites that use Google Analytics - the Mayo Clinic site, and the Canvas site.

To protect the privacy of the account used to run this scan, we obscure the IDs when we show the screenshots. The first ID will be marked by this screen:

Screen for Tracker 1

The second ID is marked by this screen:

Screen for Tracker 2

For privacy reasons, I also obscure the referrer URL and the user-agent string. The referrer URL shows the domain that was scanned, which in turn would expose the specific Canvas instance, which would compromise the privacy of the account used to run the scan. The user-agent string provides technical information about computer running the scan, including details about the web browser, version, and operating system. This information is the foundation of a device fingerprint, which can be used to identify an individual.

Step A. Consumer Google Account

Our scan begins with a person logging in to a personal GMail account.

Almost immediately after logging into GMail, the two tracking cookies are set. These cookies are set sequentially, and are mapped to one another immediately.

A call to "adservice.google.com" sets the first cookie. This initial request both sets a cookie (indicated by the value screened by "Tracker 1") and redirects to a second subdomain (googleads.g.doubleclick.net) controlled by Google:

Initial GET request

Screenshot 1

And this is the response that sets the cookie:

Response and set cookie

Screenshot 2

In the response shown above, three things can be observed/noted:

1. the initial request returns a 302 redirect that calls a new URL; 2. the location of the URL is specified in the "Location" line, highlighted in yellow; 3. the tracker value screened by "Tracker 1" is set via the "Set Cookie" directive.

The next event tracked in the scan is the get request to the URL (in the googleads.g.doubleclick.net subdomain) indicated in Screenshot 2.

Get request for Doubleclick

Screenshot 3

The screenshot below shows the response, including the directive to set the second tracking cookie (marked at "Set-Cookie").

Set Doubleclick cookie

Screenshot 4

At this point in the scan, the two cookies (marked by the "Tracker 1" and "Tracker 2" screens) that will be called repeatedly across all sites visited have been set. As shown in the screenshots, these cookies are mapped to one another from the outset. These two cookies are set after a person logs into a GMail account, so they can be tied to a person's actual identity.

As we will observe in this scan, these cookies are accessed repeatedly across multiple web sites, and connected to a range of different activities and behaviors.


Breakpoint 1: Two tracking cookies have been set. The specific responses that set the cookies are shown in Screenshots 2 and 4. As the cookie values are initially set, the values are set to "IDE" and "ANID" and it's important to note that the cookies are almost certainly synchronized with one another via the 302 redirect used to set both values sequentially. When the first cookie value is set, the response header specifies the exact call that sets the second cookie value. In practical terms, this means that Google and Doubleclick both "know" that Tracker 1 and Tracker 2 correspond to the same person. Moreover, because these cookies are set after a person logs into their personal GMail account, these values are directly tied to a person's identity.

Google provides some partial documentation on the cookies they set and access:

We also use one or more cookies for advertising we serve across the web. One of the main advertising cookies on non-Google sites is named ‘IDE‘ and is stored in browsers under the domain doubleclick.net. Another is stored in google.com and is called ANID

As shown above in Screenshot 2 the ANID value (marked by Tracker 1) is accessible from within .google.com. As shown above in Screenshot 4, the IDE value (marked by Tracker 2) is accessible from within .doubleclick.net.


Search on Google

After reading the email, we returned to google.com to do a search for "runny nose." After all, it is the season for colds.

One thing to note for any search functionality that returns suggestions while you type: this functionality doubles as a key logger. For example, when searching for "runny nose" we can observe every keystroke being sent to Google in real time.

Search autocomplete

Screenshot 5

As shown in the above screenshot, every keystroke entered while searching is tied to the first tracking cookie documented in our scan. The text entered in the search box is highlighted in yellow, and we can observe each new keystroke being sent to Google, with the get request mapped to the cookie ID set in Screenshot 2.


Breakpoint 2: Search activity on google.com is (obviously) managed by Google. The full search activity, including individual keystrokes, is tracked and tied to Tracker 1.


Step B. Medical Information

The search for information about a runny nose leads to a page on the Mayo Clinic web site. Visiting this page kicks off some additional tracking and advertising-related behavior.

First, we see the Google Analytics ID for the Mayo Clinic site mapped to the second tracking cookie ID. The Google Analytics ID for the Mayo Clinic site, along with the referrer URL, are both highlighted in yellow.

Mayo Clinic Analytics mapping

Screenshot 6

Then, we can observe what appears to be additional adtech and tracking-related behavior connected to this same tracking cookie ID

Mayo Clinic ad tracking behavior

Screenshot 7

As we can see in the above screenshot, the referrer url is from the specific page on the Mayo Clinic web site. As noted above, the cookie IDs are mapped to a specific identity known to Google. Thus, Google knows when the account used for this scan searched for a specific piece of medical information, and accessed a web site about it. Because these tracking cookies were set when a person logged into GMail, this activity can be directly tied to a specific person.


Breakpoint 3: when a person moves off a Google property, the tracking switches to Tracker 2, which can be read by Doubleclick. Screenshot 6 shows Tracker 2 being mapped to the Google Analytics ID of Mayo Clinic. Screenshot 7 shows additional ad related behavior connected to Tracker 2. In this section, we can observe two additional subdomains; stats.g.doubleclick.net (often connected to Analytics) and ad.doubleclick.net (generally connected to ads). It is not clear why the Tracker 2 value, which was clearly set in an advertising/tracking context, needs to be mapped to a Google Analytics ID.


Step C. YouTube

After visiting the Mayo Clinic web site, the scan continued on YouTube. Here, we searched for a video about a "runny nose" and watched the video.

As noted above when searching using Google, YouTube search also functions as a key logger, and ties the results to a cookie ID that is directly connected to a person's real identity.

Mapping cookies in YouTube

Screenshot 9

Screenshot 9 shows the "ru" of the eventual search query "runny nose". As shown in Screenshot 5 related to searching on Google, a request is sent for every keystroke, including spaces and deletions.


Breakpoint 4: Search activity within YouTube is managed by Google. As with search on google.com, the full search activity, including individual keystrokes, is tracked and tied to Tracker 1.


Step D. School-supplied LMS

After searching for and watching a video about a runny nose, the scan proceeded to log in to a K12 instance of Canvas.

For this scan, the person logged into the LMS with a school-provided email account. The school provided email account was not provisioned from a GSuite for EDU domain. The email address was from a domain connected to a K12 school district connected to a student account.

After the person logs into Canvas, both cookie IDs are mapped to Instructure's Google Analytics ID. The mapping occurs via 302 redirects, with the Analytics ID contained in URL calls that include the Cookie IDs in the request headers. The process is documented in the screenshots below, and is similar to the mapping that occurred while browsing the Mayo Clinic web site.

The referring URL is clearly a course within the LMS. The Google Analytics ID (UA-9138420) that belongs to Canvas/Instructure is highlighted in yellow.

The first call is to stats.g.doubleclick.net. As you can see in the screenshot below, the request includes the Google Analytics ID and the tracking cookie in the request header. The response returns a redirect that also includes the Google Analytics ID.

First call to map trackers in Canvas

Screenshot 10

As shown in Screenshot 10, the URL specified by the redirect points to google.com/ads. The redirect also contains the Google Analytics ID for Instructure.

Mapping trackers in Canvas

Screenshot 11

As described and shown in Screenshots 10 and 11, these two calls map both cookie IDs to Instructure's Google Analytics ID. To emphasize, both of the cookie IDs mapped to Instructure's Google Analytics ID are also directly connected to a personal GMail account that is tied to a person's identity.


Breakpoint 5: While logged into a school-provided (and required) LMS, both Tracker 1 and Tracker 2 are mapped to the Google Analytics ID of the LMS. This means that the same advertising IDs that are tied to a specific student's identity, tied to browsing history on a site with medical information, and tied to search history on Google and YouTube, are also tied to the Google Analytics ID of an EdTech vendor. In practical terms, this means that Google could theoretically incorporate general LMS usage data (time on site, time on page, pages visited, etc) into their profiles of learners and/or educators.


Visiting Subdomains

Visiting the subdomains called when the cookies were mapped to Instructure's Analytics ID returns web sites that appear to serve advertisers.

Attempting to visit google.com/ads redirects to a page that clearly appears to be connected to advertising:

Google Ads web page

Screenshot 12

Attempting to visit stats.g.doubleclick.net redirects to a page that offers services for analytics related to Google Marketing Platform.

Google Marketing

Screenshot 13

A look at the features overview page shows that there is a "native data onboarding integration" with Google Ads and Adsense, and "native remarketing integrations" with Google Ads.

Google Analytics integration

Screenshot 14

Additional Areas for Examination

This initial scan was limited in scope to test one specific -- yet common -- use case: what does ad tracking look like when a person has a consumer GMail account, and uses the same browser to access that personal account as their school-provided LMS? With this initial scan in place, several follow up tests would help create a more complete picture.

  • Use a school-provided Gmail account.
  • Visit other sites with ads and observe other ad-related interactions that are mapped to either of these cookies.
  • Test other LMSs that use Google Analytics to see if there is comparable mapping of Google Analytics IDs to cookie IDs.
  • Test other educational sites that use Google Analytics to see if there is comparable mapping of Google Analytics IDs to cookie IDs.

These scans would each provide additional information that would help create a more complete picture, and would build on and provide additional context to what was observed in this initial scan. If the mapping observed in this scan is replicated across the web on other educational sites that use Google Analytics within K12 or higher ed, then -- theoretically -- students could be profiled based on their interactions with sites they are required to use for school. The types of redlining, targeting, or "predictions" that would be possible from this type of profiling are clearly not in the best interests of learners.

Conclusions

This scan covers a pretty common use case: a person who checks their personal email and searches for other information, and then does some schoolwork. As documented in this writeup, this behavior results in a range of tracking behavior that includes:

  • a. tracking cookies are set shortly after a person logs into a Google account, and these cookies are directly tied to a person's specific identity;
  • b. via these cookies, Google gets specific information about searches on YouTube and Google, including keylogging of the search process;
  • c. via these cookies, Google gets specific information about the sites a person visits, and when they visit them;
  • d. on both sites in this scan that used Google Analytics, the domain's Google Analytics ID was synched with tracking cookies;
  • e. while logged in to an LMS as a high school student, the Google Analytics ID of the required LMS for a public high school student is mapped to cookie IDs that appear to be used for ad targeting, and are tied to a student's real identity.

It is not clear why Instructure's Google Analytics ID needs to be mapped to cookie IDs that are set in a consumer context and appear to be related to ad tracking.

To be very clear: the tracking cookies mapped to a person's actual identity occurred within the context of consumer use. When a person uses Gmail, or searches via Google, or browses a site for medical information, they are tracked, and they are tracked in ways that can be connected back to their real identity. This is how adtech works, and -- based on current privacy law in the United States -- this is completely legal.

As observed in this scan, the tracking cookies set in a consumer context are also accessed when a student is logged into their LMS, in a strictly educational context. In practical terms, the only way for a high school student to completely avoid the type of tracking documented in this scan would be to practice abnormally strong browser hygiene -- for example, they could set up a separate profile in Firefox that they only used while accessing the LMS. But realistically, the chances of that happening are slim to none, and "solutions" like this put the onus in the wrong place: a high school student should not be required to fix the excesses of the adtech industry, especially when they are accessing the required software that comes as a part of their legally required public education.

Dark Patterns and Passive Aggressive Disclaimers - It's CCPA Season!

4 min read

In today's notes on CCPA compliance, Dashlane gets the award for passive aggressive whinging paired with a dark pattern designed to obscure consent. I have managed to get my hands on secret video of Dashlane's team while they were planning how to structure their opt out page. This completely legitimate video is included below.

Hidden camera video of Dashlane team
Hidden camera video of the design process for Dashlane's opt out page.

In case you've never heard of Dashlane, they are a password manager. Three alternatives that are all less whingy are 1Password, LastPass, and KeePassXC -- and KeePassXC is an open source option.

Dashlane appears to be preparing for California's privacy law, CCPA, which is set to go into effect in 2020. 

The screenshot below is from Dashlane's spash page where, under CCPA, they are required to allow California residents to opt out of having their data sold. CCPA has a reasonably broad definition of what selling data means, and, predictably, some companies are upset at having any limits placed on their ability to use the data they have collected or accumulated. 

Full page screenshot

Dashlane's disclaimer and opt out page provides a good example of how a company can comply, yet exhibit bad faith in the process.

First, let's look at their description of sales as defined by CCPA:

However, the California Consumer Privacy Act (“CCPA”), defines “sale” very broadly, and it likely includes transfers of information related to advertising cookies.

Two thoughts come to mind nearly simultaneously: this is cute, and stop whining. Companies have used a range of jargon to define commercial transfers of data for years - for example, "sharing" with "affiliates", or custom definitions of what constitutes PII, or shell games with cookies that are mapped between vendors and/or mapped to a browser or device profile. It's also worth noting that Dashlane is theoretically a company that helps people maintain better privacy and security practice via centralized password management. It's hard to imagine a better example of a company that should look to exceed the basic ground level requirements of privacy laws. Instead, Dashlane appears to be whinging about it.

However, Dashlane does more than just whine about CCPA. They take the extra step of burying their opt out in a multilayered dark pattern, complete with unclear "help" text and labels.

Dark pattern

As shown in the above screenshot, Dashlane's text instructs people to make a selection in "the box below". However, two obvious problems immediately become clear. First, there is no box, below or otherwise - the splash page contains a toggle and a submit button.

Second, assuming that the toggle is what they mean by "box", we have two options: "active" or "inactive." It's not clear what option turns cookies "off" - does the "active" setting means that we have activated enhanced privacy protections, or does the "active" setting means that ad tracking is activated? This is a pretty clear example of a dark pattern, or a design pattern that intentionally misleads or confusers end users. 

Based on additional language on the splash page, it looks like the confusion that Dashlane has created is pretty meaningless because anything we set on this page appears pretty easy to wipe out, either intentionally or accidentally. So, even if the user makes the wrong choice because the language is intentionally confusing, this vague choice can get erased pretty easily.

Brittle settings

Based on this description, the ad tracking opt out sounds like it's cookie based, and therefore brittle to the point of meaningless.

While it remains to be seen how other companies will address their obligations under CCPA, I'd like to congratulate Dashlane on taking an early lead in the "toothless compliance" and "aggressive whinging" categories.

The Data Aren't Worth Anything But We'll Keep Them Forever Anyways. You're Welcome.

4 min read

Earlier this week, Instructure announced that they were being acquired by a private equity firm for nearly 2 billion dollars. 

Because Instructure offers a range of services, including a learning management system, this triggered the inevitable conversation: how much of the 2 billion price tag represented the value of the data?

The drone is private equity.

There are multiple good threads on Twitter that cover some of these details, so I won't rehash these conversations - the timelines of Laura Gibbs, Ian Linkletter, Matt Crosslin, and Kate Bowles all have some interesting commentary on the acquisition and its implications. I recommend reading their perspectives.

My one addition to the conversation is relevant both to Instructure and educational data in general. Invariably, when people raise valid privacy concerns, defenders of what currently passes as acceptable data use say that people raising privacy concerns are placing too much emphasis on the value of the data, because the data aren't worth very much.

Before we go much further, we also need to understand what we mean when we say data in this context: data are the learning experiences of students and educators; the artifacts that they have created through their effort that track and document a range of interactions and intellectual growth. "Data" in this context are personal, emotional, and intellectual effort -- and for everyone who had to use an Instructure product, their personal, emotional, and intellectual effort have become an asset that is about to be acquired by a private equity firm.

But, to return to the claims that the data have no real value: these claims about the lack of value of the underlying data are often accompanied by long descriptions of how companies function, and even longer descriptions about where the "real" value resides (hint: in these versions, it's never the data).

Here is precisely where these arguments fall apart: if the data aren't worth anything, why do companies refuse to delete them?

We can get a clear sense of the worth of the data that companies hold by looking at the lengths they go to both obfuscate their use of this data, and the lengths that they go to hold on to it. We can see a clear example of what obfuscation looks like from this post on the Instructure blog from July of 2019. The post includes this lengthy non-answer about why Canvas doesn't support basic user agency in the form of an opt out:

What can I say to people at my institution who are asking for an "opt-out" for use of their data?

When it comes to user-generated Canvas data, we talk about the fact that there are multiple data stewards who are accountable to their mission, their role, and those they serve. Students and faculty have a trust relationship with their educational institutions, and institutions rely on data in order to deliver on the promise of higher education. Similarly, Instructure is committed to being a good partner in the advancement of education, which means ensuring our client institutions are empowered to use data appropriately. Institutions who have access to data about individuals are responsible to not misuse, sell, or lose the data. As an agent of the institution, we hold ourselves to that same standard.

Related to this conversation, when we hear companies talking about developing artificial intelligence (AI) or machine learning (ML) to develop or improve their product, they are describing a process that requires significant amounts of data to start the process, and significant amounts of new/additional data to continue to develop the product.

But for all the companies, and the paid and unpaid defenders of these companies: you claim that the data have no value while simultaneously refusing to delete the data -- or to even allow a level of visibility into or learner control over how their data are used.

If -- as you claim -- the data have no value, then delete them.

Adtech, Tracking, and Misinformation: It's Still Messy

15 min read

Introduction

Over the last several months, I have wasted countless hours read through and collected online posts related to several conversational spikes that were triggered by current events. These conversational spikes contained multiple examples of outright misinformation and artificial amplification of this misinformation.

I published three writeups describing this analysis: one on a series of four spikes related to Ilhan Omar, a second related to the suicide of Jeffrey Epstein, and a third related to trolls and sockpuppets active in the conversation related to Tulsi Gabbard. For these analyses, I looked at approximately 2.7 million tweets, including the domains and YouTube videos shared.

Throughout each of these spikes, right leaning and far right web sites that specialize in false or misleading information were shared far more extensively than mainstream news sources. As shown in the writeups, there was nothing remotely close to balance in the sites shared. Rightwing sites and sources dominated the conversation, both in number of shares, and in number of domains shared.

This imbalance led me to return to a question I looked at back in 2017: is there a business model or a sustainability model for publishing misinformation and/or hate? This is a question multiple other people have asked; as one example, Buzzfeed has been on this beat for years now.

To begin to answer this question, I scanned a subset of the sites used when spreading or amplifying misinformation, along with several mainstream media sites. This scan had two immediate goals:

  • get accurate information about the types of tracking and advertising technology used on each individual site; and 
  • observe overlaps in tracking technologies used across multiple sites.

Both mainstream news sites and misinformation sites rely on advertising to generate revenue.

The companies that sell ads collect information about people, the devices they use, and their geographic location (at minimum, inferred from IP addresses, but also captured via tracking scripts), as part of how they sell and deliver ads.

This scan will help us answer several questions:

  1. what companies help these web sites generate revenue?
  2. what do these adtech companies know about us?
  3. given what these companies know about us, how does that impact their potential complicity in spreading, supporting, or profiting from misinformation?

Methodology

25 sites were scanned -- each site is listed below, followed by the number of third parties that were called on each site. The sites selected for scanning meet one or more of the following criteria: were used to amplify false or misleading narratives on social media; have a track record of posting false or misleading content; are recognized as a mainstream news site; are recognized as a partisan but legitimate web site.

Every site scan began by visiting the home page. From the home page, I followed a linked article. From the linked article, I followed a link to another article within the site, for a total of three pages in each site.

On each pageload, I allowed any banner ads to load, and then scrolled to the bottom of the page. A small number of the sites used "infinite scroll" - on these sites, I would scroll down the equivalent of approximately 3-4 screens before moving on to a new page in the site.

While visiting each site, I used OWASP ZAP (an intercepting proxy) to capture the web traffic and any third party calls. For each scan, I used a fresh browser with the browsing history, cookies, and offline files wiped clean.

Summary Results

The list of sites scanned are listed below, sorted in order of observed trackers, from low to high.

The sites at the top of the list shared information about site visitors with more third party domains. In general, each individual domain is a different company, although in some cases (like Google and Facebook) a single company can control multiple domains. This count is at the domain level, so if a site sent user information to subdomain1.foo.com and subdomain2.foo.com, the two distinct subdomains count as a single site.

  • dailycaller (dot) com -- 189
  • thegatewaypundit (dot) com -- 160
  • thedailybeast (dot) com -- 154
  • mediaite (dot) com -- 153
  • dailymail.co.uk -- 151
  • zerohedge (dot) com -- 145
  • cnn (dot) com -- 143
  • westernjournal (dot) com -- 140
  • freebeacon (dot) com -- 137
  • huffpost (dot) com -- 131
  • breitbart (dot) com -- 107
  • foxnews (dot) com -- 101
  • twitchy (dot) com -- 92
  • thefederalist (dot) com -- 88
  • townhall (dot) com -- 83
  • washingtonpost (dot) com -- 82
  • dailywire (dot) com -- 71
  • pjmedia (dot) com -- 61
  • lauraloomer.us -- 52
  • nytimes (dot) com -- 42
  • infowars (dot) com -- 40
  • vdare (dot) com -- 21
  • prageru (dot) com -- 19
  • reddit (dot) com -- 18
  • actblue (dot) com -- 13

The list below highlights the most commonly used third party domains. The list breaks out the domain, the number of times it was called, and the company that owns the domain. As shown below, the top 24 third parties were all called by 18 or more sites.

The top 24 third party sites getting data include some well known names in the general tech world, such as Google, Facebook, Amazon, Adobe, Twitter, and Oracle.

However, lesser known companies are also broadly used, and get access to user information as well. These less known companies collecting information about people's browsing habits include AppNexus, MediaMath, The Trade Desk, OpenX, Quantcast, RapLeaf, Rubicon Project, comScore, and Smart Ad Server.

Top third party domains called:

  • doubleclick.net - 25 - Google
  • googleapis.com - 24 - Google
  • facebook.com - 23 - Facebook
  • google.com - 23 - Google
  • google-analytics.com - 22 - Google
  • googletagservices.com - 22 - Google
  • gstatic.com - 22 - Google
  • adnxs.com - 21 - AppNexus
  • googlesyndication.com - 21 - Google
  • adsrvr.org - 20 - The Trade Desk
  • mathtag.com - 20 - MediaMath
  • twitter.com - 20 - Twitter
  • yahoo.com - 20 - Yahoo
  • amazon-adsystem.com - 19 - Amazon
  • bluekai.com - 19 - Oracle
  • facebook.net - 19 - Facebook
  • openx.net - 19 - OpenX
  • quantserve.com - 19 - Quantcast
  • rlcdn.com - 19 - RapLeaf
  • rubiconproject.com - 19 - Rubicon Project
  • scorecardresearch.com - 19 - comScore
  • ampproject.org - 18 - Google
  • everesttech.net - 18 - Adobe
  • smartadserver.com - 18 - Smart Ad Server (partners with Google and the Trade Desk)

The full list of domains, and the paired third party calls, are available on Github.

As noted above, Doubleclick -- an adtech and analytics service owned by Google -- is used on every single site in this scan. We'll take a look at what that means in practical terms later in this post. But other domains are also used heavily across multiple sites.

amazon-adsystem.com -- controlled by Amazon -- was called on 19 sites in the scan, including Mediaite, CNN, Reddit, Huffington Post, the Washington Post, the NY Times, Western Journal, PJ Media, ZeroHedge, the Federalist, Breitbart, and the Daily Caller.

adsrvr.org -- a domain that appears to be owned by The Trade Desk, was called on 20 sites in the scan, including Breitbart, PJMedia, ZeroHedge, The Federalist, CNN, Mediaite, Huffington Post, and the Washington Post.

Stripe -- a popular payment platform -- was called on right wing sites to outright hate sites. While I did not confirm that each payment gateway is active and functional, the chances are good that Stripe is used to process payments on some or all of the sites where it appears. Sites where calls to Stripe came up in the scan include VDare (a white nationalist site), Laura Loomer, Breitbart, and Gateway Pundit.

Stripe is primarily a payment processor, and is included here to show an additional business model -- selling merchandise -- used to generate revenue. However, multiple adtech and analytics providers are used indiscriminately on sites across the political spectrum. While some people might point to the ubiquity and reuse of adtech across the political spectrum -- and across the spectrum of news sites, from mainstream to highly partisan sites, to hate sites and misinformation sites -- as a sign of "neutrality", it is better understood as an amoral stance.

Adtech helps all of these sites generate revenue, and helps all of these sites understand what content "works" best to generate interaction and page views. When mainstream news sites use the same adtech as sites that peddle misinformation, the readers of mainstream sites have their reading and browsing habits stored and analyzed alongside the browsing habits of people who live on an information diet of misinformation. In this way, when mainstream news sites choose to have reader data exposed to third parties that also cater to misinformation sites, it potentially exposes these readers to advertising designed for misinformation platforms. In the targeted ad economy, one way to avoid being targeted is to be less visible in the data pool, and when mainstream news sites use the same adtech as misinformation sites, they sell us out and increase our visibility to targeted advertisers.

Note: Ad blockers are great. Scriptsafe, uBlock Origin, and/or Privacy Badger are all good options.

Looking at this from the perspective of an adtech or analytics vendor, they have the most to gain financially from selling to as many customers as possible, regardless of the quality or accuracy of the site. The more data they collect and retain, the more accurate (theoretically) their targeting will become. The ubiquity of adtech used across sites allows adtech vendors to skim profit off the top as they sell ads on web properties working in direct opposition to one another.

In short, while our information ecosystem slowly collapses under the weight of targeted misinformation, adtech profits from all sides, and collects more data from people being misled, thus allowing more accurate targeting of people most susceptible to misleading content over time. Understood this way, adtech has a front row seat to the steady erosion of our information ecosystem, with a couple notable caveats: first, with the dataset adtech has collected and continues to grow, they could identify the most problematic players. Second, adtech profits from lies just as much as truth, so they have a financial incentive to not care.

But don't take my word for it. In January 2017, Randall Rothenberg, the head of the Interactive Advertising Bureau (IAB, the leading trade organization for online adtech), described this issue:

We have discovered that the same paths the curious can trek to satisfy their hunger for knowledge can also be littered deliberately with ripe falsehoods, ready to be plucked by – and to poison – the guileless.

In his 2017 speech, Rothenberg correctly observes that advertising has what he describes as a "civic responsibility":

Our objective isn’t to preserve marketing and advertising. When all information becomes suspect – when it’s not just an ad impression that may be fraudulent, but the data, news, and science that undergird society itself – then we must take civic responsibility for our effect on the world.

In the same speech in 2017, Rothenberg highlights the interdependence of adtech and the people affected by it, and the responsibilities that requires from adtech companies.

First, let me dispense with the fantasy that your obligation to your company stops at the door of your company. For any enterprise that has both customers and suppliers – which is to say, every enterprise – is a part of a supply chain. And in any supply chain, especially one as complex as ours in the digital media industry, everything is interdependent – everything touches something else, which touches someone else, which eventually touches everyone else. No matter how technical your company, no matter how abstruse your particular position and the skill it takes to occupy it, you cannot divorce what you do from its effects on the human beings who lie, inevitably, at the end of this industry’s supply chain.

Based on what is clearly observable in this scan of 25 sites that featured heavily in misinformation campaigns, nearly three years after the head of the IAB called for improvements, actual improvements appear to be in very short supply.

Tracking Across the Web

To illustrate how tracking looks in practice, I did a sample scan across six web sites: Gateway Pundit Breitbart PJ Media Mediaite The Daily Beast The Federalist

While all of these sites use dozens of trackers, for reasons of time we will limit our review to two: Facebook and Google. Also, to be very clear: the proxy logs for this scan of six sites contains an enormous amount of information about what is collected, how it's shared, and the means by which data are collected and synched between companies. The discussion in this post barely scratches the surface, and this is an intentional choice. Going into more detail would have required a deeper dive into the technical implementation of tracking, and while this deeper dive would be fun, it's outside the scope of this post.

In the screenshots below, the urls sent in the headers of the request, the User Agent information, and the full cookie ID are partially obfuscated for privacy reasons.

Facebook:

Facebook sets a cookie on the first site: Gateway Pundit. This cookie has a unique ID, which gets reused across multiple sites. The initial request sent to Facebook includes a timestamp, and basic information about the system used to access the site (details like operating system, browser, browser version, and screen height and width). The request also includes the time of day, and the referring URL.

Gateway Pundit and Facebook tracking ID

At this point, Facebook doesn't need much more flesh out a device fingerprint to map to this ID to a specific device. However, a superficial scan of multiple scripts loaded by domains affiliated with Facebook suggest that Facebook collects adequate data to generate a device fingerprint, which would allow them to then tie that more specific identifier to different cookie IDs over time.

The cookie ID is consistently included in headers across multiple web sites. In the screenshot below, the cookie ID is included in a request on Breitbart:

Breitbart and Facebook tracking ID

And PJ Media:

PJ Media and Facebook tracking ID

And Mediaite:

Mediaite and Facebook tracking ID

And the Daily Beast:

Daily Beast and Facebook tracking ID

And the Federalist:

Federalist and Facebook tracking ID

Google:

Google (or more specifically, Doubleclick, which is owned by Google) works in a similar way as Facebook.

The initial Doubleclick cookie, with a unique value, gets set on the first site, Gateway Pundit. As with Facebook, this cookie is repeatedly included in header requests on every site in this scan.

Gateway Pundit and Google tracking ID

Here, we see the same ID getting included in the header on PJ Media:

PJ Media and Google tracking ID

And on Breitbart:

Breitbart and Google cookie ID

As with Facebook, Google repeatedly gets browsing information, and information about the device doing the browsing. This information is tied to a common identifier across web sites, and this common identifier can be tied to a device fingerprint, which can be used to precisely identify individuals over time. The data collected by Facebook and Google in this scan includes specific URLs accessed, and patterns of activity across the different sites. Collectively, over time, this information provides a reasonably clear picture of a person's habits and interests. If this information is combined with other data sets -- like search history from Google, or group and interaction history from Facebook, we can begin to see how browsing patterns provide an additional facet that can be immensely revealing as part of a larger profile.

Conclusion, or Thoughts on Why this Matters

Political campaigns are becoming increasingly more aggressive with how they track people and target them for outreach.

As has been demonstrated, it's not difficult to identify the location of specific individuals using even rudimentary adtech tools.

Given the opacity of the adtech industry, it can be difficult to detect and punish fraudulent behavior -- such as what happened with comScore, an adtech service used in 19 of the 25 sites scanned.

As social media platforms -- who are also adtech vendors and data brokers -- flail and fail to figure out their role, the ability to both amplify questionable content and to target people using existing adtech services provide powerful opportunities to influence people who might be prone to a nudge. This is the promise of advertising, both political and consumer, and the tools for one are readily adaptable for the other.

Adtech both profits from and extends information asymmetry. The companies that act as data brokers and adtech vendors know far more about us than we do about them. Web sites pushing misinformation -- and the people behind these sites -- can potentially use this stacked deck to underwrite and potentially profit from misinformation.

Adtech in its current form should be understood as a parasite on the news industry. When mainstream news sites throw money and data into the hands of adtech companies that also support their clear enemies, mainstream sites are actively undermining their long term interests.

Conversely, though, the adtech companies that currently profit from the spread of misinformation, and the targeting of those who are most susceptible to it, are sitting on the dataset that could help counter misinformation. The same patterns that are used to target ads and analyze individuals susceptible to those ads could be put to use to better understand -- and dismantle -- the misinformation ecosystem. And the crazy thing, and a thing that could provide hope: all it would take is one reasonably sized company to take this on.

If one company decided that, finally, enough is enough, they could theoretically work with researchers to develop an ethical framework that would allow for a comprehensive analysis of the sites that are central to spreading specific types of misinformation. While companies like Google, Facebook, Amazon, Appnexus, MediaMath, the Trade Desk, comScore, or Twitter have shown no inclination to tackle this systematically, countless smaller companies would potentially have datasets that are more than complete enough to support detailed insights.

Misinformation campaigns are happening now, across multiple platforms, across multiple countries. The reasons driving these campaigns vary, but the tradecraft used in these campaigns has overlaps. While adtech currently supports people spreading misinformation, it doesn't need to be this way. The same data that are used to target individuals could be used to counter misinformation, and make it more difficult to profit from spreading lies.

Researching Political Ads -- A Process, and an Example

13 min read

It might seem like the 2020 elections are a long way away (and in any sane democracy, they would be), but here in the US, we have a solid fourteen months of campaigning ahead of us. This means that we can look forward to fourteen months of attack ads, spurious claims, questionable information -- all of it amplified and spread via Facebook, YouTube, Instagram, Reddit, Snapchat, Telegram, Pinterest, and Twitter, to name a few.

In this post, I break down some steps that anyone can use to uncover how political ads or videos get created by looking at the organizations behind the ad.

The short version of this process:

  • Find organizations
  • Find people
  • Find addresses

Then, look for repetition and overlaps. Later in this post, I'll go into more detail about what these overlaps can potentially indicate.

Like all content I share on this blog, this post is released under a Creative Commons Non-Commercial Attribution Share-Alike license. Please use and adapt the process described here; if you use it in any derivative work please link back to this page.

1. Steps

1. Find out who is running the ad. This can be found via multiple ways, including identifying information at the end of the ad or finding social media posts or a YouTube channel sharing the ad. If an ad or a piece of content cannot immediately be sourced, that's a sign that the content might not be reliable. It's worth highlighting, though, that the clear and obvious presence of a source doesn't mean the content is reliable -- it just means that we know who created it.

2. Visit the web site of the organization or PAC running the ad, if they have one. Look for names of people involved in the project, and/or other partner orgs. While the the lack of clear and obvious disclosure of the organization/people behind a site is a reason for concern, disclosure does not mean that the source is reliable or to be trusted. The organizational affiliation, or people behind the organization, should be understood as sources for additional research.

If, after steps 1 and 2, there is no clear sign of what people or organizations are behind the ad, that can indicate that the ad is pushing unreliable or false information.

3. Do a general search on organization or PAC name. Note any distinct names and/or addresses that come up.

4. Do a search at the FEC web site for the PAC name. Note addresses, and names of key officials in the PAC.

5. Do a focused search on the exact addresses on the FEC web site. Be sure to include suite numbers.

6. Do a focused search on any key names on the FEC web site.

The point of these searches is to find repetition: shared staff across orgs, and a common address across orgs, can suggest coordination.

While follow up research on organizations sharing a common address, or staff shared across multiple orgs, would be needed to help clarify the significance of any overlaps, this triage can help uncover signs of potential coordination between orgs that don't disclose their relationships.

Searching Notes

a. When doing a general search on the web for a PAC name, start with Duck Duck Go and Startpage. Your initial search should put the organization name in quotes. If Duck Duck Go and Startpage don't get you results, then switch to Google. However, because most organizations do white hat or black hat SEO with Google in mind, using other search engines for research can often get better results.

b. When searching the FEC web site, you can often get good results without using the search interface of the FEC web site. To do this, use this structure for your search query:

  • "your precise search string" site:docquery.fec.gov or
  • "your precise search string" site:fec.gov.
  • when searching for an address, split the address and the suite number: "123 Main Street" AND 111 site:fec.gov. Using this syntax will return results at "123 Main Street, Suite 111" or "123 Main Street STE 111" or "123 Main Street # 111". 

I generally use docquery.fec.gov first, as that brings up results that are directly from the filings, but either will work.

Unlike searches across the open web, Google will often return cleaner results than searching within the FEC site.

A note on names of companies and individuals

In this writeup, we will be discussing companies, political groups, politicians, and consultants. Generally, companies, political groups, and politicians will be named in the text of this writeup.

I have reviewed screenshots and obscured names and email addresses that contained names, and in general individuals will not be named in this writeup. However, in some cases, the names of individuals will be accessible and visible via URLs shared in this document. This is a decision that I struggled with, and am still not 100% okay with, but it's hard to both show the process while not showing any potentially identifiable information.

I am not comfortable naming people, even when their names are readily available in the public record via multiple sources (and to be clear, all of the information described here is from publicly available documents). The fact that a person's name can be found via public records doesn't justify pulling a name from a public record and blasting it out. In this specific case, in this specific writeup, I made an intentional effort to not include the names in screenshots or in text. This provides some incremental protection (the names won't visible in this piece via search, for example), while still providing some clear and comprehensible instructions so that anyone can do similar research on their own.

But, for people doing this research on your own: do not be irresponsible with people's names and identities. Naming people can put a target on them, and that is just not right.

And, if anyone who reads my piece uses my work to target a person, you are engaging in reprehensible behavior. Stop. Conducting real research means that you will see real information. If you lack the moral and ethical character to use what you learn responsibly, you have no business being here.

2. Using the steps to analyze an ad

To show how to use this process, I will use the recent attack ad levelled at Representative Ocasio-Cortez during a recent debate among Democratic presidential candidates. Representative Ocasio-Cortez responded to the ad on Twitter:

AOC on attack ad

To start, OpenSecrets has a breakdown of the major funders of the PAC behind the ad. The writeup here doesn't look at the funders; it goes into more detail about the PAC, and ways of researching them. If you are looking for information about specific funders, the post at OpenSecrets is for you.

Step 1: Who is running the ad

In this instance, the group behind the ad is pretty simple to find. A person connected to the group quote tweeted Representative Ocasio-Cortez:

Response to AOC

This leads us to the web site for New Faces GOP PAC at newfacespac.com.

Step 2: visit the web site

The site features information about Elizabeth Heng, who lost the 2018 Congressional race for California District 16. It also features the ad that attacks Representative Ocasio-Cortez.

The site also includes a donation form, and at the bottom of the form we can see a small piece of text: "Secured by Anedot." This text gives us a bookmark.

Anedot embedded form

Many forms on political sites are embedded from third party services, and if we look at the original form we can often get useful information. To find the location of this form, we hit "ctrl-U" to view the page source, and then search the text (using ctrl-F) for "anedot".

This identifies the URL of the form.

Anedot URL

Strip away "?embed=true" from the end of the link, and you can go directly to the original form. In this case, the form gets us an address:

Anedot address

We'll note that for use later.

Step 3: Search for the PAC name

A search for "New Faces GOP" turns up a listing in Bizapedia.

New Faces GOP search

This listing provides three additional names, and two additional addresses: a physical address in Texas, and a PO Box in California.

Bizapedia main page

The Texas address (700 Lavaca St Ste 1401, Austin, TX 78701) is a commonly used listing address for multiple organizations, which is a sign of a registry service. 

Lavaca Bizapedia

The California address (PO Box 5434 Fresno, CA 93755) appears to be used less widely.

Step 4: Do a search at the FEC web site for the PAC name

A search of the FEC web site returns the main page at https://www.fec.gov/data/committee/C00700252/?tab=summary - this page provides a review of fundraising and spending.

Additional details are available from the "Filings" tab at https://www.fec.gov/data/committee/C00700252/?tab=filings

FEC docs list

The Statements of Organization provide an overview of key people in the organization, and of relevant addresses.

The most recent Statement of Organization (shown below) contains the same Fresno PO Box (PO Box 5434 Fresno, CA 93755) found in the Bizapedia listing. The filings also include the name of a treasurer. We will note this name for focused searches later.

FEC filing for New Faces GOP

At the end of Step 4, we have the following information:

  • multiple addresses to investigate;
  • multiple people connected to the PAC;
  • by virtue of having information pulled directly from FEC filings, some confirmation that our information is accurate;

Step 5: Do a focused search on the exact addresses on the FEC web site

For this search, we have three main addresses: the Fresno PO Box; the Austin, TX address; and the Washington DC address.

The Fresno PO Box links primarily to filings for New Faces GOP PAC, and for Elizabeth Heng's failed congressional bid.

FEC search PO Box

The search for the Texas address returns no direct results.

The search for the Washington DC address returns results for multiple different PACs, all connected to the Washington DC address.

FEC search on DD address

The FEC results also include the name of a consulting firm, "9Seven Consulting."

In the spending cycle for 2018, this firm received $156,000 in disclosed payments, per OpenSecrets.

Oddly, a web search for "9Seven Consulting" returns a top hit of a Digital Consulting firm named "Campaign Solutions" that also appears to be the employer of the person listed across multiple PACs connected to the Washington DC address at 499 South Capitol Street SW, Suite 405. These results are consistent across DuckDuckGo and Google.

9Seven search results

A search on that address returns yet another advocacy group.

Prime Advocacy search results

This group claims to specialize in setting up meetings with lawmakers.

By the end of Step 5, we have collected and/or confirmed the following information:

  • we have confirmed that many PACs list the Washington, DC address as their place of business;
  • we have confirmed that at least two political consulting firms list the same Washington, DC address as their place of business
  • we have confirmed that multiple PACs list a key employee that is also part of a digital consulting firm

Step 6. Do a focused search on any key names on the FEC web site

For this search, we will focus on the name that appears across multiple filings. A Google search returns 135 results. Based on a quick scan of names, these PACs appear to be almost exclusively right leaning. Obviously, the results contain some repetition, but there are upwards of 25 unique PACs here. In the screenshot below, the same name appeared on all results; it is obscured for privacy reasons.

Search results with name obscured

Additionally, the same name is connected to an IRS filing connected to George Papadopolous. This filing also uses the same DC address.

Shared name

Based on the results of this search, it appears pretty clear that these PACs were supported by a common process or a common entity. The combination of shared staff on their filings and, in some cases, a shared address, could imply a degree of coordination. Clearly, the DC address is used as at least a mailing address for multiple organizations that have at least some general overlap in political goals.

What Does All This Mean

The information uncovered via this process helps us understand what this ad is, what this ad isn't, and how political content gets generated.

Clearly, the group behind the ad is connected to Republican and right wing political organizing. It is unclear whether or not the shared infrastructure and shared process used to create these PACs indicates any level of cooperation across PACs, or whether the PAC-generating infrastructure is akin to HR outsourcing companies that manage payroll and benefits for smaller companies - but given the overlaps described in this post, a degree of coordination would certainly be possible and straightforward to set up, if it doesn't already exist.

The infrastructure supporting the New Faces GOP PAC seems solid. Based on their FEC filings, the group was formed in March of 2019, and by the end of June had raised over $170,000.00. While this isn't a huge amount of money by the standards of national political campaigns, it's still significant, and this level of access to donors, paired with access to the organizational expertise to manage the PAC, suggests a level of support that would be abnormal for a true grassroots effort.

However, this research just scratches the surface; on the basis of what we've seen here, there are multiple other PACs, people, and addresses that could expand the loose network we are beginning to see here. Political funding and PACs are a rabbit hole, and this research has us at the very beginning, leaning over and peering into the abyss.

But, understanding the ad in this context helps us see that it is one facet of what is likely a larger strategy that uses leaders like Representative Alexandria Ocasio-Cortez as foils to energize the Republican base. The hyperbolic rhetoric used in the ad normalizes overblown claims and irrational appeals in an effort to drown out conversations about policy. PACs can be used to fund a body of content that can help fuel future conversational spikes as needed, and to introduce narratives. Because PACs are so simple to form -- especially when there are consultancies designed that appear to bundle PAC creation with a digital distribution plan -- PACs can be thought of as a form os specialized link farm. https://en.wikipedia.org/wiki/Link_farm Just like link farms, PACs provide a way of spamming the conversation with messages from orgs that can be discarded, and subsequently reborn under a different name.

The message matters, but the message in this case becomes clearer when filtered through the ecosystem of PACs that helped create it.

One final note

The research that fueled this writeup isn't especially time consuming. It took me about 10 minutes of searching. The writeup took a while -- they always do -- but the process of doing a quick triage is very accessible. More importantly, every time you do it, you get better and faster. Also, it's not necessary to review every ad. Just do some - learn the process. By learning this research process, you can both see the forces that help shape (some highly misleading) political advertisements and get a clearer view into the process that allows money to shape politics. We are better able to disrupt and debunk what we understand.

Four Things I Would Recommend for Mycroft.ai

2 min read

Mycroft.ai is an open source voice assistant. It provides functionality that compares with Alexa, Google Home, Siri, etc, but with the potential to avoid the privacy issues of these proprietary systems. Because of the privacy advantages of using an open source system, Mycroft has an opportunity to distinguish itself in ways that would be meaningful, especially within educational settings.

If I was part of the team behind Mycroft.ai, these are four things I would recommend doing as soon as possible (and possibly this work is already in progress -- as I said, I'm not part of the team).

  1. Write a blog post (and/or update documentation) that describes exactly how data are used, paying particular attention to what stays on the device and what data (if any) need to be moved off the device.
  2. Develop curriculum for using Mycroft.ai in K12 STEAM classes, especially focusing on the Raspberry Pi and Linux versions.
  3. Build skills that focus on two areas: learning the technical details required to build skills for Mycroft devices; and a series of equity and social justice resources, perhaps in partnership with Facing History and/or Teaching Tolerance. As an added benefit, the process of building these skills could form the basis of the curriculum for point 2, above.
  4. Get foundation or grant funding to supply schools doing Mycroft development with Mycroft-compatible devices

Voice activated software can be done well without creating unnecessary privacy risks. Large tech companies have a spotty track record -- at best -- of creating consistent, transparent rules about how they protect and respect the privacy of the people using their systems. Many people -- even technologists -- aren't aware of the alternatives. That's both a risk and an opportunity for open source and open hardware initiatives like Mycroft.ai.

Fordham CLIP Study on the Marketplace for Student Data: Thoughts and Reactions

9 min read

A new study was released today from the Fordham Center on Law and Information Policy (Fordham CLIP) on the marketplace for student data. It's a compelling read, and the opening sentence of the abstract provides a clear description of what is to follow:

Student lists are commercially available for purchase on the basis of ethnicity, affluence, religion, lifestyle, awkwardness, and even a perceived or predicted need for family planning services.

The study includes four recommendations that help frame the conversation. I'm including them here as points of reference.

  1. The commercial marketplace for student information should not be a subterranean market. Parents, students, and the general public should be able to reasonably know (i) the identities of student data brokers, (ii) what lists and selects they are selling, and (iii) where the data for student lists and selects derives. A model like the Fair Credit Reporting Act (FCRA) should apply to compilation, sale, and use of student data once outside of schools and FERPA protections. If data brokers are selling information on students based on stereotypes, this should be transparent and subject to parental and public scrutiny.
  2. Brokers of student data should be required to follow reasonable procedures to assure maximum possible accuracy of student data. Parents and emancipated students should be able to gain access to their student data and correct inaccuracies. Student data brokers should be obligated to notify purchasers and other downstream users when previously transferred data is proven inaccurate and these data recipients should be required to correct the inaccuracy.
  3. Parents and emancipated students should be able to opt out of uses of student data for commercial purposes unrelated to education or military recruitment.
  4. When surveys are administered to students through schools, data practices should be transparent, students and families should be informed as to any commercial purposes of surveys before they are administered, and there should be compliance with other obligations under the Protection of Pupil Rights Amendment (PPRA).

The study uses a conservative methodology to identify vendors selling student data, so in practical terms, they are almost certainly under-counting the number of vendors selling student data. One of the vendors selling student data identified in the survey clearly states that they have information on students between 2 and 13:

Our detailed and exhaustive set of student e-mail database has names of students between the ages of 2 and 13.

I am including a screenshot of the page to account for any changes that happen to this page into the future.

Students between 2 and 13

The study details multiple ways that data brokers actively (and in some cases, enthusiastically) exploit youth. One vendor had no qualms about selling a list of 14 and 15 year old girls for targeting around family planning services. The following quotation is from a sales representative responding to an inquiry from a researcher:

I know that your target audience was fourteen and fifteen year old girls for family planning services. I can definitely do the list you’re looking for -- I just have a couple more questions.

The study also highlights that, even for a motivated and informed research team, uncovering details about where data is collected from is often not possible. Companies have no legal obligation to disclose this information, and therefore, they don't. The observations of the research team dovetail with my firsthand experience researching similar issues. Unless there is a clear and undeniable legal reason for a company to disclose a specific piece of information, many companies will stonewall, obfuscate, or outright refuse to be transparent.

The study also emphasizes two of the elephants in the room regarding the privacy of students and youth: both FERPA and COPPA have enormous loopholes, and it's possible to be fully compliant with both laws and still do terrible things that erode privacy. The study covers some high level details, and as I've described in the past, FERPA directory information is valuable information.

The study also highlights the role of state level laws like SOPIPA. SOPIPA-style laws have been passed in multiple states nationwide, starting in California. This might actually feel like progress. However, when one stops and realizes that there have been a grand total of zero sanctions under SOPIPA, it's hard to escape the sense that some regulations are more privacy theater than privacy protection. While a strict count of sanctions under SOPIPA is a blunt measure of effectiveness, the lack of regulatory activity under SOPIPA since the law's passage either indicates that all the problems identified in SOPIPA have been fixed (hah!) or that the impact of the regulation is nonexistent. If a law passes and it's not enforced, what is the impact?

The report also notes that the data collected, shared, and/or sold goes far beyond simple contact information. The report details that one vendor collects information on a range of physical and mental health issues, family history regarding domestic abuse, and immigration status.

One bright spot in the report is that, among the small number of school districts that responded to the researcher's requests for information, none appeared to be selling or sharing student information to advertisers. However, even this bright area is undermined by the small number of districts surveyed, and the fact that some districts took over a year to respond, and with at least one district not responding at all.

The report details the different ways that school-age youth are profiled by data brokers, with their information sold to support targeted advertising. While the report doesn't emphasize this, we need to understand profiling and advertising as separate but unrelated issues. A targeted ad is an indication that profiling is occurring; profiling is an indication that data collection from or about students is occurring -- but we need to address the specific problems of each of these elements distinctly. Advertising, profiling (including combining data from multiple sources), and data collection without clearly obtained informed consent are each distinct problems that should be understood both individually and collectively.

If you work with youth (or, frankly, if you care about the future and want to add a layer of depth to how you understand information literacy) the report should be read multiple times, and shared and discussed with your colleagues. I strongly encourage this as required reading in both teacher training programs, and as back to school reading for educators in the fall of 2018.

But, taking a step back, the implications of this report shine a light on serious holes in how we understand "student" data. The report also demonstrates how the current requirement that a person be able to show a demonstrable harm from misuse of personal information is a sham. Moving forward, we need to refine and revise how we discuss misuse of information.

Many of the problems and abuses arise from systemic and entrenched lack of transparency. As demonstrated in the report:

It is difficult for parents and students to obtain specificity on data sources with an email, a phone call, or an internet search. From the perspective of parents and students, there is no data trail. Likewise, parents and students are generally unable to know how and why certain student lists were compiled or the basis for designating a student as associated with a particular attribute. Despite all of this, student lists are commercially available for purchase on the basis of ethnicity, affluence, religion, lifestyle, awkwardness, and even a perceived or predicted need for family planning services.

This is what information asymetry looks like, and it mirrors multiple other power imbalances that stack the deck against those with less power. As documented in multiple places in the survey, a team of skilled researchers with legal, educational, and technical expertise were not able to pierce the veil of opacity maintained by data brokers and advertisers. It is both unrealistic and unethical to expect a person to be able to demonstrate harm from the use of specific data elements when the companies in a position to do the harm have no requirement to explain anything about their practices, including what data they used and how they obtained it.

But taking an additional step back, the report calls into question what we consider "student" data. The marketplace for data on school age people looks a lot like the market for people who are past the traditional school age: a complete lack of transparency about how the data are gathered, sold, used, and retained. It feels worse with youth because adults are somehow supposed to know better, but this is a fallacy. When we turn 18, or 21, or 35, or 50, we aren't magically given a guidebook about how data brokers and profiling work. The information asymmetry documented in the Fordham report is the same for adults as it is for youth. Both adults and youth face comparable problems, but the injustice of the current systems are more obvious when kids are the target.

Companies collect data about people, and some of the people happen to be students. Possibly, some of these data might have been collected within an educational context. But, even if the edtech industry had airtight privacy and security, multiple other sources for data about youth exist. Between video games, health-related data breaches (which often contain data about youth and families in the breached records), Disney and comparable companies, Equifax, Experian, Transunion, Axciom, Musical.ly, Snapchat, Instagram, Facebook Messenger, parental oversharing on social media, and publicly available data sources, there is no shortage of readily available data about youth, their families, and their demographics. When we pair that with technology companies (both inside and outside edtech) going out of business and liquidating their data as part of the bankruptcy process, the ability to get information about youth and their families is clearly not an issue.

It's more accurate to say data that have been collected on people who are school age. To be very clear, data collected in a learning environment is incredibly sensitive, and deserves strong protections. But drawing a line between "educational" data and everything else misses the point. Non-educational data can be used to do the same types of redlining as educational data. If we claim to care about student privacy, then we need to do a better job with privacy in general.

This is what is at stake when we talk about the need to limit our ISPs from selling our web browsing history, our cellular providers from selling our usage information -- including precise information, in real time, about our location. What we consider student data is tied up in the data trails of their parents, friends, relatives, colleagues -- information about a younger sister is tied to that of her older siblings. Privacy isn't an individual trait. We are all in this together.

Read the study. Share the study. It's important work that helps quantify and clarify issues related to data privacy for adults and youth.

Privacy Postcard: Starbucks Mobile App

2 min read

For more information about Privacy Postcards, read this post.

General Information

App permissions

The Starbucks app has permissions to read your contacts, and to get network location and location from GPS.

Starbucks app permissions

Access contacts

The application permissions indicate that the app can access contacts, and this is reinforced in the privacy policy.

600

Law enforcement

Starbucks terms specify that they will share data if sharing the information is required by law, or if sharing information helps protect Starbuck's rights.

Starbucks law enforcement

Location information and Device IDs

Starbucks can use location as part of a broader user profile.

Starbucks collects location info

Data Combined from External Sources

The terms specify that Starbucks can collect, store, and use information about you from multiple sources, including other companies.

Starbucks data collection

Third Party Collection

The terms state that Starbucks can allow third parties to collect device and location information.

Third party

Social Sharing or Login

The terms state that Starbucks facilitates tracking across multiple services.

Social sharing

Summary of Risk

The Starbucks mobile app has several problematic areas. Individually, they would all be grounds for concern. Collectively, they show a clear lack of regard for the privacy of people who use the Starbucks app. The fact that the service harvests contacts, and harvests location information, and allows selected information to be used by third parties to profile people creates significant privacy risk.

People shouldn't have to sell out their contact list and share their physical location to get a cup of coffee. I love coffee as much as the next person, but avoid the app (and maybe go to a local coffee shop), pay cash, and tip the barista well.

Privacy Postcards, or Poison Pill Privacy

10 min read

NOTE: While this is obvious to most people, I am restating this here for additional emphasis: this is my personal blog, and only represents my personal opinions. In this space, I am only writing for myself. END NOTE.

I am going to begin this post with a shocking, outrageous, hyperbolic statement: privacy policies are difficult to read.

Shocking. I know. Take a moment to pull yourself up from the fainting couch. Even Facebook doesn't read all the necessary terms. Policies are dense, difficult to parse, and in many cases appear to be overwhelming by design.

When evaluating a piece of technology, "regular" people want an answer to one simple question: how will this app or service impact my privacy?

It's a reasonable question, and this process is designed to make it easier to get an answer to that question. When we evaluate the potential privacy risks of a service, good practice can often be undone by a single bad practice, so the art of assessing risk is often the art of searching for the poison pill.

To highight that this process is both not comprehensive and focused on surfacing risks, I'm calling this process Privacy Postcards, or Poison Pill Privacy - it is not designed to be comprehensive, at all. Instead, it is designed to highlight potential problem areas that impact privacy. It's also designed to be straightforward enough that anyone can do this. Various privacy concerns are broken down, and include keywords that can be used to find relevant text in the policies.

To see an example of what this looks like in action, check out this example. The rest of this post explains the rationale behind the process.

If anyone reading this works in K12 education and you want to use this with students as part of media literacy, please let me know. I'd love to support this process, or just hear how it went and how the process could be improved

1. The Process

Application/Service

Collect some general information about the service under evaluation.

  • Name of Service:
  • Android App
  • Privacy Policy url:
  • Policy Effective Date:

App permissions

Pull a screenshot of selected app permissions from the Google Play store. The iOS store from Apple does not support the transparency that is implemented in the Google Play store. If the service being evaluated does not have a mobile app, or only has an iOS version, skip this step.

The listing of app permissions is useful because it highlights some of the information that the service collects. The listing of app permissions is not a complete list of what the service collects, nor does it provide insight into how the information is used, shared, or sold. However, the breakdown of app permissions is a good tool to use to get a snapshot of how well or poorly the service limits data collection to just what is needed to deliver the service.

Access contacts

Accessing contacts from a phone or address book is one way that we can compromise our own privacy, and the privacy of our friends, family, and colleagues. This can be especially true for people who work in jobs where they have access to sensitive information or priviliged information. For example, if a therapist had contact information of patients stored in their phone and that information was harvested by an app, that could potentially compromise the privacy of the therapist's clients.

When looking at if or how contacts are accessed, it's useful to cross-reference what the app permissions tell us against what the privacy policy tells us. For example, if the app permissions state that the app can access contacts and the privacy policy says nothing about how contacts are protected, that's a sign that the privacy policy could have areas that are incomplete and/or inadequate.

Keywords: contact, friend, list, access

Law enforcement

Virtually every service in the US needs to comply with law enforcement requests, should they come in. However, the languaga that a service uses about how they comply with law enforcement requests can tell us a lot about how a service's posture around protecting user privacy.

Additionally, is a service has no language in their terms about how they respond to law enforcement or other legal requests, that can be an indicator that the terms have other areas where the terms are incomplete and/or inadequate.

Keywords: legal, law enforcement, comply

Location information and Device IDs

As individual data elements, both a physical location and a device ID are sensitive pieces of information. It's also worth noting that there are multiple ways to get location information, and different ways of identifying an individual device. The easiest way to get precise location information is via the GPS functionality in mobile devices. However, IP addresses can also be mapped to specific locations, and a string of IP addresses (ie, what someone would get if they connected to a wireless network at their house, a local coffee shop, and a library) can give a sense of someone's movement over time.

Device IDs are unique identifiers, and every phone or tablet has multiple IDs that are unique to the device. Additionally, browser fingerprinting can be used on its own or alongside other IDs to precisely identify an individual.

The combination of a device ID and location provides the grail for data brokers and other trackers, such as advertisers: the ability to tie online and offline behavior to a specific identity. Once a data broker knows that a person with a specific device goes to a set of specific locations, they can use that information to refine what they know about a person. In this way, data collectors build and maintain profiles over time.

Keywords: location, zip, postal, identifier, browser, device, ID, street, address

Data Combined from External Sources

As noted above, if a data broker can use a device ID and location information to tie a person to a location, they can then combine information from external sources to create a more thorough profile about a person, and that person's colleagues, friends, and families.

We can see examples of data recombination in how Experian sorts humans into classes: data recombination helps them identify and distinguish their "Picture Perfect Families" from the "Stock cars and State Parks" and the "Urban Survivors" and the "Small Towns Shallow Pockets".

And yes, the company combining this data and making these classifications is the same company that sold data to an identity thief and was responsible for a breach affecting 15 million people. Data recombination matters, and device identifiers within data sets allow companies to connect disparate data sources into a larger, more coherent profile.

Keywords: combine, enhance, augment, source

Third Party Collection

If a service allows third parties to collect data from users of the service, that creates an opportunity for each of these third parties to get information about people in the ways that we have described above. Third parties can access a range of information (such as device IDs, browser fingerprints, and browsing histories) about users on a service, and frequently, there is no practical way for people using a service to know what third parties are collecting information, or how these third parties will use it.

Additionally, third parties can also combine data from multiple sources.

Keywords: third, third party, external, partner, affiliate

Social Sharing or Login

Social Sharing or Login, when viewed through a privacy lens, should be seen as a specialized form of third party data collection. With social login, however, information about a person can be exchanged between the two services, or taken from one service.

Social login and social sharing features (like the Facebook "like" button, a "Pin it" link, or a "Share on Twitter" link) can send tracking information back to the home sites, even if the share never happens. Solutions like this option from Heise highlight how this privacy issue can be addressed.

Keywords: login, external, social, share, sharing

Education-specific Language

This category only makes sense on services that are used in educational contexts. For services that are only used in a consumer context, this section might be superfluous.

As noted below, I'm including COPPA in the list of keywords here even though COPPA is a consumer law. Because COPPA (in the US) is focused on children under 13, there are times when COPPA connects with educational settings.

Keywords: parent, teacher, student, school, , family, education, FERPA, child, COPPA

Other

Because this list of concerns is incomplete, and there are other problematic areas, we need a place to highlight these concerns if and when they come up. When I use this structure, I will use this section to highlight interesting elements within the terms that don't fit into the other sections.

If, however, there are elements in the other sections that are especially problematic, I probably won't spend the time on this section.

Summary of Risk

This section is used to summarize the types of privacy risks associated with the service. As with this entire process, the goal here is not to be comprehensive. Rather, this section highlights potential risk, and whether those risks are in line with what a service does. IE, if a service collects location information, how is that information both protected from unwarranted use by third parties and used to benefit the user?

2. Closing Notes

At the risk of repeating myself unnecessarily, this process is not intended to be comprehensive.

The only goal here is to streamline the process of identify and describing poison pills buried in privacy policies. This method of evaluation is not thorough. It will not capture every detail. It will even miss problems. But, it will catch a lot of things as well. In a world where nothing is perfect, this process will hopefully prove useful.

The categories listed here all define different ways that data can be collected and used. One of the categories explicitly left out of the Privacy Postcard is data deletion. This is not an oversight; this is an intentional choice. Deletion is not well understood, and actual deletion is easier to do in theory than in practice. This is a longer conversation, but the main reason that I am leaving deletion out of the categories I include here is that data deletion generally doesn't touch any data collected by third party adtech allowed on a service. Because of this, assurances about data deletion can often create more confusion. The remedy to this, of course, is for a service to not use any third party adtech, and to have strict contractual requirements with any third party services (like analytics providers) that restrict data use. Many educational software providers already do this, and it would be great to see this adopted more broadly within the tech industry at large.

The ongoing voyage of MySpace data - sold to an adtech company in 2011, re-sold in 2016, and breached in 2016 - highlights that data that is collected and not deleted can have a long shelf life, completely outside the context in which it was originally collected.

For those who want to use this structure to create your own Privacy Postcards, I have created a skeleton structure on Github. Please, feel free to clone this, copy it, modify it, and make it your own.