Bill has worked in education as an English and history teacher, an administrator, and a technology director. Bill initially discovered the Internet in the mid-1990's at the insistence of a student who wouldn't stop talking about it.

twitter.com/funnymonkey

github.com/billfitzgerald

drupal.org/user/19631

Amazon and Whole Foods: Can I Have Some Data with that Kale?

4 min read

It looks like Amazon is buying Whole Foods

Let's take a step back and look at the data involved here. We will start by looking at a person who only uses Amazon to shop online, buys food from Whole Foods, and reads using the Kindle app.

For anyone who has ever bought something, Amazon has our home address, and possibly related shipping addresses (ie, ifyou have ever bought something as a gift and had it shipped directly to the recipient). Amazon potentially has one or more credit cards stored for us. Amazon has our purchasing history, and our browsing history. If we ever responded to an ad online for an Amazon product, Amazon has that referrer history, and can infer and expand their profile on us based on the sites that refer us to Amazon.

And, of course, Amazon collects information about all the different devices you use to access Amazon services - so Amazon has a precise record of all the hardware and software you use when you shop, potentially going back to when you first started shopping online. If you can't remember the phone you used in 2007, Amazon could probably tell you.

Moving on to Whole Foods, every time someone uses a credit card in the store, Whole Foods gets the person's name, their credit card number, their geographic location (the store), the time they were there, and the list of items they have purchased. Cross referencing this information with data collected by Amazon, the credit card number or name and zip code could be sufficient to connect these data sets with close to 100% certainty.

For people who use the Whole Foods App, the list of data collected by Whole Foods expands dramatically. The application collects geographic location, device information (ie, the brand of phone or tablet, some form of device ID, the IP addresses it uses, etc), presumably an email address, and the ability to read and access wireless and bluetooth connections. I'm not sure if Whole Foods does tracking via bluetooth beacons, but the app permissions for the android app leave that open as a possibility. If the Whole Foods app does ship with bluetooth tracking enabled, anyone with the app installed and running can be tracked via bluetooth beacons from just about anywhere. Potentially, if tracking was set up between any of Amazon's home devices (the Echo, etc) and the Whole Foods app that Amazon can now access, that would be a very effective way to map in-person social connections and online/offline activity.

If a person shops online at Amazon, buys (expensive) food at Whole Foods, and reads using the Kindle app, then we are also sharing our reading history, patterns, reading speed, and book buying history with Amazon. This data can also be used to infer interests (a person reads one type of book over another, and reads this type of book faster than another), habits (a person generally reads in the morning, and for a certain amount of time), and other personal patterns. When reading habits are cross-referenced against other personal habits (like the food we buy or the items we shop for) it creates a more complete profile of an individual. 

It doesn't take much of a leap to see how a list of the food we buy, the items we shop for, the information we read, and where and when we do each of these actions would be of interest in things like health care. 

And, of course, Amazon has been moving into health care. And, given that we are seeing more experiments using things like sentiment analysis and wearable tech as a means to adjust insurance rates, scenarios that include shopping lists in insurance calculations aren't a stretch.

It's also worth noting that the depth of the Whole Foods data set will be a boon for companies like Amazon that look at differential pricing. Amazon will now be in a great position to identify people willing to pay more for everyday items.

So, have fun shopping at Whole Foods. That organic, free range, hormone free chicken you will be eating tonight will be pecking in your data trail for a while. 

Twitter's Misleading User Experience When Reporting Abuse

2 min read

Twitter's history of combating trolls and abuse has been problematic, at best.

Recently, I discovered a corner in their toolkit that highlights why Twitter's current efforts remain ineffective.

When reporting a person for abuse (or, more likely, a bot), Twitter leads you through a multi-step process. 

In the first step, we select an account or a tweet to report.

Step 1 

In the second step, we define the reason for the report.

Step 2

In the third step, we provide additional details.

Step 3

In the fourth step, we indicate who is being harassed.

Step 4

In the fifth step, we select up to five tweets that demonstrate the harassment.

Step 5

In the sixth step, we decide whether we want to block the account, mute the account, or do neither. When we click "Done", the offending tweets we reported are no longer visible. Voila. The process has worked.

Step 6

Except, it hasn't. Despite appearances, Twitter has done nothing to address the abuse. When you are logged in, you can't see the Tweets you reported. To the rest of the world - including, literally, everyone who isn't you - the content is still visible. This almost certainly includes search engines.

From your perspective, it actually looks like Twitter has done something, but from a practical perspective, Twitter has engaged in a game of smoke and mirrors. This happens regardless of whether or not we select "Block" or "Mute"; Twitter still hides the tweets you reported from you, and you alone.

This is dangerous. If a person has been doxxed on Twitter and they report the tweet, Twitter's UX creates the misleading impression that the offending content has been removed. The solution to this problem is simple: Twitter should let the "Block" or "Mute" options work as intended. While this wouldn't fix Twitter's abysmal record of responding to abuse, it would at least provide a more honest user experience.

When Twitter automatically hides offensive content from the people who have reported it, they create the impression that they have done something, when they have done nothing. Design choices like this demonstrates Twitter's apathy towards effectively addressing hate and abuse on their platform.

Edmodo Has Removed Tracking From Their Web Site For Students and Teachers

2 min read

Last night, I heard from representatives at Edmodo in response to my post on ad trackers. I need to emphasize at the outset that the speed of their response here is a very positive sign. I published my post around 9:00 AM on a Saturday, and I heard from them less than 12 hours later on Saturday night.

In their email to me, they shared that the code and tracking behavior I observed was left over from testing. While they investigate solutions, they are both removing this code, and turning off ads. This change is already in place. As of this writing, there no longer appears to be any tracking of teacher or student accounts. I have done a quick visual examination to verify this with my test accounts.

This is the right step to take. Edmodo deserves credit for making this step, and making it so quickly. I am hoping and optimistic that this is a permanent change.

UPDATE: May 14, 2017

I have heard additional details from the team at Edmodo about their technical implementation. Although my original post was not about their Beta Sponsored Content program, they wanted to be very clear that, for that program, they used Doubleclick's COPPA-compliant flag. The information they conveyed to me is included below:

"(f)or the ads we recently started serving to Edmodo users through Doubleclick, we turned on the COPPA-compliant tag. The COPPA-compliant tag is supposed to prevent behavioral tracking. We have turned off those ads until we can confirm that the COPPA-compliant tag is working properly to prevent behavioral tracking."

END UPDATE

Tracking of Teachers and Students in Edmodo

7 min read

UPDATE: Sunday, May 14th, 2017 - I heard from Edmodo last night, and they have removed the tracking that is observed and discussed in this post from their web application. Their response was fast, and they deserve a lot of credit for making this decision, and implementing it quickly. Details available here. END UPDATE

0. Introduction

This has been a rough week for Edmodo. Unlike many other people, I will not be writing about the data breach that leaked information about 77 million Edmodo users. Instead, in this post, I will look at ad tracking within Edmodo that affects both teachers and students.

Looking at Edmodo was not on my list of things to do this week. I did this research on my personal time, completely disconnected from my work. The reason I was looking at all was that I received a message from a person advising me about what to look for, and this message contained details that made the report credible. While I can't promise I will be able to research everything sent my way, I am always interested in working with students, parents, and teachers. If you see something that looks or feels odd, please be in touch.

1. Process

For this post, I set up a test Edmodo teacher account, and two sample student accounts. I observed traffic while logged in using OWASP ZAP. The test student account in this test was from a student in a fourth grade class, so the student would be under 13. All cookies, the browser cache, and browsing history were cleared prior to testing. The browser was re-cleared between all test sessions.

2. What We Aren't Looking At

This spring, Edmodo announced that they are allowing ads (Edmodo calls them "sponsored" or "promoted" content) to be displayed in their site. This post is not about Edmodo displaying ads in their site.

3. Displaying Ads versus Tracking

There is a big difference between displaying an ad and tracking users. When an ad is displayed, the actual ad can be understood as a visual indication of potential tracking.

However, users can be tracked without ads being immediately displayed. This type of tracking is largely invisible to end users, but this tracking sends a regular stream of data back to the data broker/ad network. This data includes, at minimum, the page a user is on, the precise time they are on it, the operating system and version, the IP address of the user, and the browser and version. All of this information is tied together via a common identifier. In many cases, the combination of technical factors about a user - device information and/or IP address - is adequate to identify, or come close to identifying, an individual. Because this information is all tied together with a common identifier, the probability of identifying an individual increases.

Because of this, we treat the display of ads as a separate issue from tracking users. Both can be problematic, and ads can be displayed with or without user tracking. In this post, I focus only on mechanisms used to track users.

4. Tracking Teachers

Teachers are targeted by a range of ad trackers, as shown below. The teacher login occurs in line 175; we can observe multiple trackers getting called after login.

Tracking teachers

This is pretty standard ad tracking behavior, and we are not going to spend additional time on this, as the student tracking is more complicated. However, for educators using Edmodo, this is how your usage information is passed to data brokers when you are logged into the site working with students.

5. Tracking Students

In Edmodo, students are exposed to targeted ad tracking as follows. I will open with a brief description, and then follow that with a more detailed description that includes screenshots from the proxy logs used to capture traffic.

5.1 Brief description

  • A. When a student logs in to Edmodo, Edmodo allows Google's Doubleclick to set a tracking cookie.
  • B. While a student is logged in, there are additional calls to Doubleclick. These calls include information about the student's computer, and the page that they are currently on.
  • C. When the student logs out of Edmodo, this triggers a call to Doubleclick.
  • D. In turn, this spawns two additional calls to ad networks. The ID value that is sent to Doubleclick is the same value that is set when the student logged in, and the referrer from Edmodo clearly identifies the user as a student.

5.2 Details

5.2.A. When a student logs in to Edmodo, Edmodo allows Google's Doubleclick to set a tracking cookie.

Setting a cookie on a student account at login

The login occurs in line 141. The call to Doubleclick occurs after login in line 160.

Setting a cookie value

In the above screenshot, Doubleclick sets a cookie in the student's browser with a unique ID. The test account in this writeup is a student in a fourth grade class, so the student would be well under 13. Edmodo allows teachers to specify student grade level of their courses, so arguably Edmodo would have actual knowledge in some cases if a student is under 13.

Choosing a grade level in Edmodo

5.2.B. While a student is logged in, there are additional calls to Doubleclick. These calls include information about the student's computer, and the page that they are currently on.

Additional calls to Doubleclick

Each of these individual calls contain information about the students path through the platform, which is shared with Doubleclick and tied to the tracking ID created in Step A.

5.2.C. When the student logs out of Edmodo, this triggers a call to Doubleclick.

Student logout

The logout occurs in Line 554. The calls to Doubleclick occur in Lines 561, 564, 571, and 573. These calls are discussed in more detail below.

5.2.D. In turn, this spawns two additional calls to ad networks.

Calls to multiple networks

The ID value that is sent to Doubleclick is the same value that is set when the student logged in, and the referrer from Edmodo clearly identifies the user as a student (note the user_type=student at the end of the URL).

On the left hand side of the screenshot, you will notice a reference to "pubmatic" and "rubicon." These are two commonly used ad brokers: https://pubmatic.com and http://rubiconproject.com

Calls are made to these two ad brokers based on the redirect observed above.

rubicon

And:

Pubmatic

6. This Couldn't Happen Without Edmodo's Active Involvement

To see a little bit behind the mechanics here, we need to take a look at the source code on Edmodo's site. The screenshot below is taken from the page source, while logged in as a student user in a test fourth grade class.

Hardcoded Google IDs

Note the conversion ID that Edmodo has hardcoded into their web page. Then, we will take a look at the call that is made to Doubleclick after our test 4th grade student has logged in:

Google IDs sent over

The referrer here is the student's home page within Edmodo, and the call to Doubleclick includes the hardcoded value set by Edmodo.

7. Conclusions

As documented in this post, the presence of ad trackers for both teachers and students can be observed when we inspect traffic via an intercepting proxy. Some obvious questions that come to mind are:

  1. How aware are teachers in the Edmodo community that they are being tracked by ad brokers permitted on the site by Edmodo?
  2. How aware are students, teachers, and parents that ad brokers can collect data on students while using Edmodo?
  3. How does the presence of ad trackers that push information about student use to data brokers improve student learning?
  4. Are Edmodo Ambassadors briefed on the student-level tracking that occurs within Edmodo? If not, why not?

An additional (and likely) possibility here is that not everyone within Edmodo is aware that this tracking is occurring. Companies are not monoliths, and few decisions within companies have the support and/or awareness of everyone in the company.

It is also possible that the student level tracking is the result of a technical error that did not get caught by a QA/testing process.

There are additional questions that can and should be asked, but in the interest of keeping a narrow focus, I will leave things here.

Ad Tracking on Kaiser Permanente's Patient Health Portal

4 min read

Last night, I logged onto the Kaiser Permanente patient health portal. I hadn't done this in a while.

I use a javascript blocker in my web browser. After logging into the site, I was very surprised to see a call to Google Ad Manager.

Call to Google Ads

This sparked my curiosity, so I decided to run the entire session through an intercepting proxy.

The intercepting proxy showed that Kaiser Permanente permits multiple ad trackers to collect data about people seeking health information from the Kaiser Permanente patient portal. To be clear, I was logged in to the portal - I was not browsing anonymously. The observed trackers specifically target logged in users.

In my very brief test, I observed the following trackers: Google Ad Services, WebTrends, Demdex, Omniture, and Doubleclick (which is part of Google). The screenshot below shows a subset of these trackers, taken from the intercepting proxy. I have saved the proxy logs in case it's ever necessary to review or verify them.

Trackers, after login

Kaiser is very clear in their terms that, in their member health portal, they allow third party ad trackers to collect information about patients at Kaiser that use their health portal.

Their terms lack any details about any limits placed on how these third parties can use the data they collect from patients who have logged in to Kaiser's portal seeking health information. Specifically, the terms do not state that third parties who collect data from Kaiser's patient health portal are prohibited from enhancing or potentially re-identifying people within the data set. It's also worth noting that the "opt out" feature is completely ineffective.

However, even basic information could help advertisers target or exploit users. If a person logs onto the Kaiser site four times in a week, that tells a different story to ad trackers than a person that logs onto the site once a month.

Then, if that same person logs onto the Kaiser patient health portal and heads over to WebMD to look for additional information, data brokers can connect the same individual (via cookie values set on the Kaiser site) to both sites.

This ad tracking takes on an even more invasive and intrusive tone for parents who have linked a child's account, or for an adult who is managing health care for an aging parent or sick spouse or partner. Because Kaiser permits ad trackers on its health portal (or really, on our health portal), these intimate, highly personal moments are exposed to ad trackers and data brokers.

The opportunistic business models of data brokers are clearly documented. Packaging health information is good business for them. Data brokers know that people with health issues or concerns can be more vulnerable. As Frank Pasqualle notes in this piece from 2014, data brokers create and sell multiple lists that target health-related issues:

They have created lists of victims of sexual assault, and lists of people with sexually transmitted diseases. Lists of people who have Alzheimer’s, dementia and AIDS. Lists of the impotent and the depressed.

Because of the language Kaiser has included in their terms, it is clear that Kaiser has made a very intentional decision: they are allowing patients looking for health information to be targeted by ad trackers. Kaiser should provide some additional clarity about this practice, and answering these questions would be a good start:

  • What third party trackers are allowed on the Kaiser Site to collect data about logged in Kaiser patients?
  • How long have these trackers been allowed on Kaiser's Health Portal?
  • For each tracker, what data are collected? How is this data used?
  • Why were these ad trackers chosen over other ad trackers?
  • How much revenue is generated for Kaiser via these ad trackers? What are the precise details of the business arrangement between the ad trackers and Kaiser Permanente?
  • How can a Kaiser patient who uses the portal review all of the data that Kaiser has allowed to be collected about them?
  • How does the placement of these ad trackers on the Kaiser Permanente web site, that collect information about logged in users, improve patient outcomes?

I will be contacting Kaiser directly to share these concerns, and I will update this post and/or write follow up posts to share what I learn.

TRUSTe's Opt Out Is a Cynical Joke

1 min read

I've been meaning to write this out for a while.

TRUSTe's "opt-out" option is a cynical joke. The page is here: http://preferences-mgr.truste.com/

Here are a few reasons why this "solution" is worthless.

  • this opt-out "solution" doesn't stop data collection, it just stops the display of ads.
  • participation in this program isn't required; it's voluntary. Some vendors don't participate at all, where others participate but don't integrate with TRUSTe's platform.
    This doesn't work.

    From an end user perspective, this means that opting out via TRUSTe's "solution" requires visiting multiple sites, just to trigger an opt-out mechanism that doesn't actually stop data collection.
  • the "solution" is cookie based, so whenever a person resets their cookies, even these pathetically limited opt out options go away.

An actual solution is using a javascript blocker, and/or uBlock Origin and Privacy Badger.

When an industry-backed "solution" is this toothless, it creates the distinct impression that industry is phoning this in. Over the last few years, the FTC and the New York Attorney General have seen some problems here as well.

BuzzFeed and Methods for Tracking the Trackers; or This Is Hard, Chapter 9674

7 min read

For the last several months, Kris Shaffer and I have been working together on tracking news sites, partisan sites, and hate sites, and their relative popularity on social media. We have also been looking at the advertising and tracking technology used on these sites in an effort to understand how these sites generate revenue. Based on our research, with initial summaries published between February and late March, 2017, we concluded that ad tech and tracking allows misleading news and hate speech to generate revenue.

Kris has three posts on the subject:

I published this piece:

We have been continuing this work because, while our early research showed some significant and interesting patterns, these issues are complex, and we want to be thorough.

Fortunately, there are other people doing similar work. This BuzzFeed article published in early April looks at very similar details to what Kris and I have been researching, and reaches some similar conclusions. However, when reviewing the data behind the BuzzFeed work, I noticed some anomalies that appear to be related to the methodology used to collect the data supporting the BuzzFeed piece.

At the outset, I want to highlight that this conversation wouldn't be possible if all of us weren't describing our methods. While the methodology of the BuzzFeed piece omits some essential details, the overall conclusions still hold up. The need to counter misinformation and the business models that make misinformation profitable are universally recognized, and the more people we have looking at these details, the better.

The more we credit the range of work happening in this space, the better. One paper I hadn't seen until yesterday was this study from Mezzobit. I will definitely be reaching out to look at this service. I have also benefited from being to talk with and learn from David Carroll, Chris Gilliard, Jeff Graham, and Girard Kelly, among others.

But, returning to the BuzzFeed story, this post will look at 3 main concerns: the methodology, the focus on display ads versus the larger ecosystem, and how BuzzFeed's adtech practices compare to the companies they study. I have additional questions on the use of archive.org as a tool to track adtech, but a detailed discussion of that topic is outside the scope of this post.

Methodology (Ghostery-based versus intercepting proxy)

Our methodology in studying trackers is pretty straightforward. We use OWASP ZAP (an intercepting proxy) to capture activity when we visit a site. Then, we export all URLs from the session, which is core functionality in the proxy. Then, we use tldextract to break these URLs down into their component pieces to make them easier to study. This gives us a precise (albeit labor intensive) view into what trackers are placed on what sites.

There are multiple ways to get this view, each with their own advantages and drawbacks. The BuzzFeed methodology uses a web-based tool:

Liliana Bounegru, a a co-investigator on the upcoming A Field Guide To Fake News, used the Tracker Tracker tool to extract ad trackers currently present on the homepage and one article page of each of these sites. Some sites on the list are no longer active, so those were discarded in the analysis. Bounegru then used the Wayback Machine to look for archived versions of the homepage and an article page for each of these sites prior to November 2016. In the end, we identified 51 sites that had trackers on their archived pages and were still online in March 2017.

The tracker tool used to drive the BuzzFeed article is a web-based tool that appears to be based on Ghostery. While its output is informative, it's not precise enough to be considered complete. It's still a useful tool because it's going to be imprecise in consistent ways, but the imprecision can lead to a lack of necessary detail.

As an example, the BuzzFeed article mentions multiple sites where they were unable to identify the source of some ads and their associated ad networks.

The networks serving ads on the pages were collected into a spreadsheet. In some cases, we were not able to identify the provider responsible for pop-unders that were present on several sites. We noted that in the spreadsheet.

Using an intercepting proxy, identifying the source of the pop-under ads is pretty straightforward. We ran a test on TMZWorldStarNews, one of the sites identified as having unidentified pop-under ads. The full archive of the BuzzFeed data set is available on Github.

In our review, the url of the pop-under was http://www.sike.tv/topvideos/?utm_source=advertisecom&utm_campaign=STV-01-RON&utm_term=72822-iy

When we look at the URL, it contains the string "?utm_source=advertisecom" - and advertise.com is a known ad network. When we take a deeper look into the proxy logs, we can see the full set of popunders that will be triggered by this provider, along with the affiliated urls used to deliver content. In tracking ads, affiliating domains with specific providers is both important and difficult to do. Using an intercepting proxy helps give a clearer view of the actual traffic, which helps make these connections.

Display ads versus trackers/advertising ecosystem

The BuzzFeed article appears to direct attention onto what ads get displayed, rather than the larger tracking ecosystem.

In order to determine the ads currently running on fake news sites, Silverman visited 76 active fake news sites without an ad blocker, and with the Ghostery browser plugin enabled. (Ghostery identifies which ad trackers are active on a given webpage, and is also used in the Tracker Tracker tool.) For each site he visited the homepage and at least one article page to examine the ads.

However, focusing solely on the display of ads omits the larger ecosystem of vendors that track users. Using the example of TMZWorldStarNews, the BuzzFeed dataset doesn't identify any trackers.

Using an intercepting proxy, we observe nearly 800 different calls to several hundred distinct urls while visiting the homepage and a single article on TMZWorldStarNews. Scores of these distinct URLs belong to ad trackers. Each of these ad trackers get data on users, and many of these ad trackers appear affiliated because they pass cookie IDs to one another. These affiliations are visible via an intercepting proxy, although spotting them requires some detailed searches through the proxy logs.

What does BuzzFeed do?

Another interesting question that we encounter in our study of ad tracking centers on how more mainstream sites track their visitors and deliver ads. It's one thing to say that ad networks will indiscriminately sell to misinformation sites, but it's still another thing when mainstream sites continue to work with ad tech vendors who will sell to anyone. If we look at the web through the lens of ad tech, many web sites with very different content have significant overlaps via the ad tech they use.

From a quick glance at the ad tech used on BuzzFeed, we see some overlaps with what we observed on TMZWorldStarNews. Both sites make calls to the third party sites/ad trackers listed below:

  • agkn.com
  • crwdcntrl.net
  • demdex.net
  • doubleclick.com
  • facebook.com
  • moatads.com
  • nexac.com
  • quantserve.com
  • quantcount.com
  • scorecardresearch.com
  • twitter.com

BuzzFeed is not alone here. As we observed earlier, other mainstream sites use the same adtech as highly partisan or misinformation sites.

How can we expect ad trackers to heed calls for increased responsibility when mainstream news organizations continue to give money and user data to companies that support misinformation?

Conclusion

Tracking ad trackers is far more complex than it should be - and getting the details right is essential in mapping the terrain. Ad tracking - and the profiling it requires - is central to making misinformation profitable. It also lays the foundation for increased information asymmetry, which is a key element in maintaining existing power structures. We need to make the entire ad tracking system easier to understand. It's difficult, complicated work - and that's another reason why people who care about getting this right need to work together.

ISPs Can Continue to Collect and Sell All of Our Browsing History, and We'll Never Know

4 min read

Yesterday, on March 28, 2017, Congressional Republicans gave a huge gift to Internet Service Providers (ISPs) by killing rules that would have prevented them from selling our browsing history. Because Congressional Republicans killed these rules, ISPs - companies like Comcast, Verizon, Qwest (aka CenturyLink), AT&T, etc - can continue to sell information about how we browse the web. All browsing we do on the web - from a young child looking for information about dinosaurs, to a teen curious about their sexual identity, to a person reading the news, to a parent looking for medical information, to a person browsing pornography - all of these activities, done inside people's homes, can continue to be tracked and sold to anyone, without our knowledge or consent.

We need to pause here - this is actually as bad as it sounds. If you have kids in your house, their browsing activity can be bundled and sold by your ISP. As their parent, you will never be told that any sale took place, who the buyers are, and how they are using that information. So, the next time your kids are having a playdate, if your kid's friends connect to your internet, your ISP is profiting from the playdate. Thanks to the actions of congressional Republicans, this is universal across the US. Every ISP in the US can continue to do this.

However, this isn't the worst of it. An element that has gone largely undiscussed is how this rule change puts ISPs in a commanding position when it comes to connecting online and offline behavior. Connecting online and offline identity is a leading concern with advertisers - and rest assured, they are looking at this through a racial lens as well. For people who connect to the internet via a phone and a computer, our ISP can now identify both devices as belonging to a specific home. This is incredibly valuable information - and because this information can be shared and sold indiscriminately, it allows for a solid connection to be made between an individual, their home address, their computer, and their phone.

In practical terms, this sets ISPs in a position to be able to track our physical location over time, and predict our location in real time. For all of us who carry smartphones, our phones connect to multiple ISPs over the course of every day - from different cell towers, to coffee shop wireless, to library wireless, to connectivity provided by our school or workplace. If our ISP shares our device information, we can be precisely identified across a range of locations, and a record of our movement can be stored and collected. Location data has been shown to be a strong predictor of identity, but our ISPs are in a position where location data is just a small part of their overall data set

At this point, the only real protection is to use a VPN. However, many VPNs only protect a single device - to protect a home requires setting up a VPN on all devices, or configuring a router to connect to the internet via a VPN. While setting up a router to connect via a VPN is not enormously complicated, it's a significant technical barrier that will definitely be beyond the reach of many consumers.

It's also worth noting that VPNs will only be a realistic alternative if our ISPs don't throttle VPN connections, and reduce their speed to a crawl. Because of actions taken in the FCC under Tom Wheeler, we currently have some protections, but Republicans are also looking to kill net neutrality. This would be bad for a variety of reasons, but would also be another blow to personal privacy.

Adtech and Misinformation: the Middlemen Who Sell to All Sides

18 min read

When we look at adtech, much of the focus falls on either of these two places: the advertisers who sell specific products via online advertising, or the data brokers who package and sell our information. And this is good - data brokers in particular pose a unique threat, and need much more attention. However, online ads get delivered via a network of middlemen that automate and streamline the process. These middlemen are effective - for example, when we read about youth in Macedonia making thousands of dollars a month from political misinformation aimed at US audiences, we need to remember that the profits generated by these sites wouldn't happen without the use of adtech (These middlemen are generally described by those in the advertising industry in jargon-heavy prose. For people looking for a high-level background, this post discusses Supply Side Platforms; this post discusses Demand Side Platforms, and this video describes how ads are bought, targeted, and delivered)

In a speech from January 2017, Randall Rothenberg - the President and CEO of the Interactive Advertising Bureau, or IAB - directly acknowledged the effectiveness of these middlemen, and the role that advertising plays in making misinformation profitable:

As an industry, it is our obligation to again step up. But this time, our goal cannot be merely to fix our supply chain. Our objective isn’t to preserve marketing and advertising. When all information becomes suspect – when it’s not just an ad impression that may be fraudulent, but the data, news, and science that undergird society itself – then we must take civic responsibility for our effect on the world.

Who Shares What

A few weeks back, Kris Shaffer and I began talking about getting a clearer understanding about what people read and share on Twitter, and what that looks like across the political spectrum. We started looking at the stories shared - and the sites they are shared from - to get a sense of patterns. Based on patterns observed in the data Kris collected, I created a list of 25 sites from across the political spectrum, ranging from misinformation targeted to progressives, to left leaning, to mainstream media, to right leaning, to hate sites.

  • Addicting Info
  • AllenBWest
  • Alternet
  • Bipartisan Report
  • Breitbart
  • Daily Caller
  • Daily Stormer
  • Fox News Insider
  • Gateway Pundit
  • Guardian
  • Huffington Post
  • New York Times
  • Newsmax
  • Patriot Post
  • Project Veritas
  • Ralph Retort
  • Reddit
  • RT
  • The Atlantic
  • The Blaze
  • Wall Street Journal
  • Washington Post
  • White Rabbit Radio
  • YouTube
  • ZeroHedge

This list of sites is obviously not exhaustive, and the work here is just a start, but from this initial review, some interesting patterns begin to emerge. Over the entire list of 25 sites, just under 500 different ad tracking domains are called. Of all of these adtech companies, over 60 are used on 10 or more sites. Amazon Adsystem is used on 12 of the 25 sites; 18 of the 25 sites use adtech supplied by Yahoo, and 23 of the 25 sites use Doubleclick, from Google.

To collect information about the URLs called when visiting each site, we set up an intercepting proxy - for these tests, I used OWASP ZAP, an open source tool. I browsed using Firefox, and set up a custom profile to use while testing. Before visiting each site, I removed all browsing history, cache, and cookies. To test each site, I visited the home page, a story linked from the home page, and a second story or page on the site, for a total of three pages per site. Then, using reporting functionality built into ZAP, I exported all the URLs called while visiting each site.

This is the same list of sites, sorted by the number of different domains called when visiting each site. At the risk of getting overly technical, a "single domain" looks at the base path, so if a site made a call to "api.doubleclick.com" and "ads.doubleclick.com" that counts as a single domain because of the common base of "doubleclick.com":

  • ZeroHedge: 184
  • AllenBWest: 183
  • Daily Caller: 152
  • Gateway Pundit: 142
  • The Blaze: 139
  • Bipartisan Report: 129
  • Huffington Post: 129
  • RT: 121
  • Alternet: 116
  • Breitbart: 87
  • Ralph Retort: 82
  • New York Times: 78
  • Newsmax: 78
  • Addicting Info: 75
  • Washington Post: 71
  • The Atlantic: 68
  • Fox News Insider: 55
  • Wall Street Journal: 49
  • Project Veritas: 20
  • Guardian: 18
  • Reddit: 16
  • White Rabbit Radio: 16
  • Daily Stormer: 12
  • YouTube: 11
  • Patriot Post: 10

The core dataset used for the analysis in this post is available here. At the end of this post, I also include additional details on the trackers used and their affiliations.

When we drill down into individual sites, we see that Daily Stormer and White Rabbit Radio - two far right sites - make use of Doubleclick advertising - owned by Google - to generate ad revenue.

Daily Stormer:

  • doubleclick.net
  • google.com
  • googleapis.com
  • gstatic.com
  • google-analytics.com
  • facebook.com
  • twitter.com
  • facebook.net
  • fbcdn.net
  • youtube.com
  • twimg.com
  • dailystormer.com

White Rabbit Radio:

  • doubleclick.net
  • google.com
  • googleapis.com
  • gstatic.com
  • google-analytics.com
  • twitter.com
  • youtube.com
  • twimg.com
  • scorecardresearch.com
  • disqus.com
  • newrelic.com
  • nr-data.net
  • statcounter.com
  • mixcloud.com
  • stripe.com
  • whiterabbitradio.net

It's also worth noting that owners of Daily Stormer and White Rabbit Radio use Google Analytics to understand how people interact with their sites. Daily Stormer and White Rabbit Radio have plenty of company here: Addicting Info, AllenBWest.com, Alternet, Breitbart, Daily Caller, Fox News Insider, the New York Times, Reddit, and the Atlantic - among others - also use this common infrastructure that is provided by - and sends data to - Google. When we look at the web through the lens of adtech, one thing becomes abundantly clear: adtech vendors sell indiscriminately, across the political spectrum and social spectrum. It's also clear that without the complicity of adtech vendors, sites on the political fringe - both right and left - would have far fewer resources.

The broad use of adtech to generate revenue creates a level of interconnectedness and dependency between content providers and the adtech networks that profit from them. Ads allow content providers to make money, but that money means different things - for both ad networks and content providers. The overhead of a four person content farm publishing dishonest clickbait, or of a right wing blog like Gateway Pundit, or of a left wing blog like Alternet, or a right wing site like Breitbart, is very different than that of the Wall Street Journal, or the New York Times, or the Washington Post. Yet, all of these sites use (as just one example) comScore, a data aggregator and ad exchange.

When Breitbart attempts to undermine the credibility of the Washington Post and the New York Times, the services offered by comScore make that profitable.

Breitbart

When Gateway Pundit attempts to undermine the credibility of the Washington Post and the New York Times, the services offered by comScore make that profitable.

Gateway Pundit

When the Times and the Post cover news, they also attempt to generate revenue via the services of comScore. But, as these publishers fire back at one another, comScore - like its brothers in adtech - generates consistent revenue as a result of the crossfire where comScore arms both sides.

As we look at the sites people read and share on social media, what should we make of the fact that 10 of the 25 sites surveyed (Addicting Info, AllenBWest, Alternet, Bipartisan Report, Daily Stormer, Gateway Pundit, Huffington Post, Newsmax, Patriot Post, and Project Veritas) load this 18,400 line javascript file supplied by Facebook? Should we believe that all of these sites, from all over the political and social spectrum, require this enormous file to function? Have web developers and site owners become lazy and willing to use prefab components without considering the larger consequences?

The pervasive use of the same adtech provided by the same companies to competing sites and political enemies raises some interesting questions.

  • What does it mean that the same individual companies that specialize in data collection, analysis, and reuse are woven into our news and information systems across the ideological spectrum?
  • What does it mean when the act of reading news, or engaging in political activism online, is an observed activity?
  • What does it mean when legitimate news outlets are reliant on a small number of adtech companies for revenue, and these adtech companies sell to anyone, regardless of whether they traffic in hate, deception, or news?
  • Given the higher expenses of doing news well - with editors, paid writers, professional fact checkers - what obligations, if any, does adtech have to police hate speech or propaganda?

Tracking the Trackers

Tracking the companies that profit from selling ads to all sides is complicated by the fact that adtech is highly opaque. If we attempt to visit the URLs that show up in the proxy logs, we are met with a dizzying array of responses, the vast majority of which are completely uninformative. Getting a company name generally requires some or all of these four steps:

Even with these tools, nailing down a specific URL to a specific domain can be time consuming. As an example, the domain adsrvr.org is called in 17 out of the 25 sites we surveyed. Visiting their home page shows nothing.

A Whois lookup indicates that the domain has been registered anonymously via GoDaddy, so there is no company information publicly available for the domain.

anon registration

However, we can track the IP address of the domain and do a reverse IP address lookup, which indicates three other sites hosted on the same IP address.

Reverse IP Lookup

A search using the domain names turns up this opt-out page, which confirms that The Trade Desk controls adsrvr.org.

This is ridiculously opaque. You should not need to know how to do a reverse IP address lookup or a Whois search to know what company is collecting information on you - and this is just for one tracker loaded on one site. The Huffington Post loads upwards of 120 trackers; the Daily Caller loads upwards of 150; AllenBWest.com loads over 175. If we estimate very low, and assume that we can identify each tracker in 5 minutes, that still comes out to 600 minutes to identify all the trackers used by the Huffington post, or 875 minutes for AllenBWest.com. The adtech companies that profit from our data and sell access to us as we surf the web use this opacity to further their business interests.

Commonly Used Trackers

As noted earlier, just over 60 trackers/services are used in 10 or more of the 25 sites we surveyed. For this post, I identified the companies involved to get a sense of who some of the bigger players are. To emphasize, the list of 25 sites surveyed is incomplete yet representative. This is the beginning of work that will likely be ongoing, as time allows.

But when we look at the 60 services used most commonly across these 25 different sites, the vast majority of these companies are IAB members. As Randall Rotherberg, the President and CEO of the IAB observed in his speech:

(W)e face a challenge that has boiled over into crisis, perhaps the greatest crisis it is possible to face. For it is a crisis not of our industry, not of our digital media and marketing village, but a crisis of society writ large.

Right now, the status quo in adtech is to sell to all sides, and profit from both the arms race and the battles. While our discourse and news ecosystem remains mired in misinformation, adtech pulls profit.

Adtech profits when we read lies, and adtech allows liars to earn revenue.

Adtech profits when we read hate speech, and adtech allows the people who spread hate to earn revenue.

Adtech profits when places like the Huffington Post convince writers to publish for "exposure," and adtech allows the Huffington Post to generate revenue for these exploitive practices.

Adtech profits when people read traditional news outlets, and adtech allows these news outlets to generate revenue.

It's worth remembering that the impact of ad revenue will vary based on the overhead within an organization. The more a site cuts corners, eliminates editors and fact checkers, or doesn't pay writers, the greater the benefit of revenue generated via adtech. The benefits of adtech tilt the scales toward falsehood, sensationalism, and hate. Adtech in its current form - predicated on online monitoring of consumers, and selling access to user data via ad exchanges - gives a decided advantage to those who are willing to bypass facts in favor of bias, superficiality, or an emotional appeal.

Appendix 1 -Dataset

Full dataset in csv format:https://gist.github.com/billfitzgerald/5965a6009a9b939f4155cffea2fe8170

Appendix 2 - List of third party services, by URL

  • doubleclick.net - used in 23 sites.
  • google.com - used in 22 sites.
  • googleapis.com - used in 22 sites.
  • gstatic.com - used in 21 sites.
  • google-analytics.com - used in 21 sites.
  • googlesyndication.com - used in 21 sites.
  • scorecardresearch.com - used in 20 sites.
  • facebook.com - used in 19 sites.
  • googletagservices.com - used in 19 sites.
  • adnxs.com - used in 18 sites.
  • demdex.net - used in 18 sites.
  • yahoo.com - used in 18 sites.
  • twitter.com - used in 17 sites.
  • facebook.net - used in 17 sites.
  • adsrvr.org - used in 17 sites.
  • bluekai.com - used in 17 sites.
  • tubemogul.com - used in 17 sites.
  • advertising.com - used in 16 sites.
  • bidswitch.net - used in 16 sites.
  • openx.net - used in 16 sites.
  • tidaltv.com - used in 16 sites.
  • turn.com - used in 16 sites.
  • agkn.com - used in 15 sites.
  • casalemedia.com - used in 15 sites.
  • rubiconproject.com - used in 15 sites.
  • sitescout.com - used in 15 sites.
  • tapad.com - used in 15 sites.
  • 1rx.io - used in 15 sites.
  • 2mdn.net - used in 14 sites.
  • moatads.com - used in 14 sites.
  • nexac.com - used in 14 sites.
  • simpli.fi - used in 14 sites.
  • contextweb.com - used in 14 sites.
  • crwdcntrl.net - used in 14 sites.
  • gwallet.com - used in 14 sites.
  • quantserve.com - used in 14 sites.
  • rfihub.com - used in 14 sites.
  • spotxchange.com - used in 14 sites.
  • cloudfront.net - used in 13 sites.
  • adap.tv - used in 13 sites.
  • addthis.com - used in 13 sites.
  • revsci.net - used in 13 sites.
  • adtechus.com - used in 12 sites.
  • amazon-adsystem.com - used in 12 sites.
  • adsymptotic.com - used in 12 sites.
  • dotomi.com - used in 12 sites.
  • media6degrees.com - used in 12 sites.
  • mxptint.net - used in 12 sites.
  • chango.com - used in 11 sites.
  • eyereturn.com - used in 11 sites.
  • bidr.io - used in 11 sites.
  • eqads.com - used in 11 sites.
  • everesttech.net - used in 11 sites.
  • pubmatic.com - used in 11 sites.
  • youtube.com - used in 10 sites.
  • fbcdn.net - used in 10 sites.
  • adhigh.net - used in 10 sites.
  • cloudflare.com - used in 10 sites.
  • ib-ibi.com - used in 10 sites.
  • mathtag.com - used in 10 sites.
  • basebanner.com - used in 10 sites.
  • eyeviewads.com - used in 10 sites.
  • tribalfusion.com - used in 10 sites.

Appendix 3 - List of most used third party services, with additional details

1rx.io

RhythmOne is an IAB member.

adap.tv

AOL/adap.tv is an IAB member.

addthis.com

AddThis is an IAB member

adhigh.net

adnxs.com

AppNexus is an IAB member.

adsrvr.org

The Trade Desk is an IAB member

adsymptotic.com

Drawbridge is an IAB member.

adtechus.com

AOL is an IAB member.

advertising.com

Advertising.com is an IAB member

agkn.com

Neustar is an IAB member

amazon-adsystem.com

Amazon is an IAB member.

basebanner.com

ConvertMedia is an IAB member

bidr.io

bidswitch.net

bluekai.com

IAB and TRUSTe member

casalemedia.com

Index Exchange is an IAB member

chango.com

Rubicon Project is an IAB Member

cloudfront.net

  • Owned by Amazon
  • Used primarily as a cdn, so its use will vary widely among sites.
  • Not explicitly used for ad networks

cloudflare.com

  • Like Cloudfront, Cloudflare is a CDN

contextweb.com

Pulsepoint is an IAB member

crwdcntrl.net

Lotame is an IAB member

demdex.net

Adobe is an IAB member, although their Marketing Cloud appears to not be an IAB member.

dotomi.com

Conversant is an IAB member

eqads.com

everesttech.net

Adobe is an IAB member.

eyereturn.com

Eyereturn is an IAB member

eyeviewads.com

Eyeview is an IAB member.

Facebook companies

  • facebook.com
  • facebook.net
  • fbcdn.net

Facebook is an IAB member.

Google companies

The following domains are associated with Google services - some are ad-related, some (like YouTube) both provide a service and tracking. All of these domains - individually - were called on 10 or more of the 25 sites surveyed.

  • doubleclick.net
  • google.com
  • googleapis.com
  • google-analytics.com
  • googlesyndication.com
  • gstatic.com
  • googletagservices.com
  • 2mdn.net (part of Google/Doubleclick)
  • youtube.com

Additional info on these domains/services:

Google is an IAB member.

gwallet.com

RadiumOne is an IAB member

ib-ibi.com

mathtag.com

MediaMath is an IAB member.

media6degrees.com

Dstillery is an IAB member.

moatads.com

mxptint.net

Maxpoint is an IAB member.

nexac.com

Datalogix is an IAB member.

openx.net

OpenX is an IAB member.

pubmatic.com

Pubmatic is an IAB member.

quantserve.com

Quantcast is an IAB member.

revsci.net

AudienceScience is an IAB member.

rfihub.com

Rocket Fuel is an IAB member.

rubiconproject.com

Rubicon Project is an IAB Member.

scorecardresearch.com

comScore is an IAB member.

simpli.fi

Simplifi is an IAB member.

sitescout.com

Sitescout is an IAB member.

spotxchange.com

SpotXchange is an IAB member.

tapad.com

Tapad Inc is an IAB member.

tidaltv.com

Videology is an IAB member.

tribalfusion.com

Exponential is the company name https://apps.ghostery.com/en/apps/tribal_fusion Exponential is an IAB Member.

tubemogul.com

Tubemogul is an IAB member.

turn.com

Turn is an IAB member.

twitter.com

yahoo.com

Yahoo is owned by Verizon, and is an IAB member.

Google, Lawsuits, and the Importance of Good Documentation

8 min read

This week, the Mississippi Attorney General sued Google, claiming that Google is mining student data. In this post, I'll share some general, personal thoughts, and some recommendations for Google.

To start, it's worth watching a statement from the press conference where the suit was announced - this video clip was shared by Anna Wolfe, a journalist who covered the event.

At 1:46 in the video, the AG describes the "tests" that were run. To be blunt, these tests don't sound like actual tests - it sounds more like browsing and looking at the screen. Unless the student account they were using was relatively new, had never done any searches on the topic being "tested," had never browsed while logged in to any non-Google site that had ad tracking, and all testing browsers had their cache, cookies, and browsing history cleared, there are a range of benign options that could explain behavior that looks like targeted ads. And that doesn't even take into account the difference between targeted ads based on past behavior, and content-based ads delivered because a page describes a specific subject.

Without additional detail from the Mississippi AG on how they tested for tracking, the current claims of tracking are less than persuasive.

G Suite Terms, and (a Lack of) Clarity

An area where Google can improve is highlighted in the suit: Google's terms, and the way Google describes how educational data are handled, are not easily accessible or comprehensible (all the necessary disclaimers apply: I am not a lawyer, this is not legal advice, etc, etc). This commentary is limited to transparency and clarity. With that said, Google could blunt a lot of the claims and criticisms they receive with better documentation. The people who are doing this work at Google are smart and talented - they should be allowed to describe the details of their work more effectively.

Google has built a "Trust" page for G Suite, formerly known as Google Apps for Education. The opening paragraphs of text on this page highlight the confusing complexity of Google's terms.

Opening text from Trust page

In this opening text, Google links to five different policies that govern use of Google products in education:

However, this list of five different legal documents leaves out five additional documents that potentially govern use of G Suite in Education:

Of these five additional documents, two (the Data Processing Amendment and the Model Contract Clauses) are optional. However, these ten documents are not listed together in a single, coherent list anywhere on the Google site that I have found. The trust page also links to this list of Google services that are not included in G Suite/Google Apps for Education, but that can be enabled within G Suite. The list includes over 40 individual services, which are all covered by different sets of terms.

Moving down the "Trust" page, we see several different words or phrases used to refer to the Education Terms: "contracts," "G Suite Agreement," and "agreements." These all link to the same document, but the different names for the same document make it more difficult to follow than it needs to be.

Some simple things Google could do on the "Trust" page:

  • list out all applicable terms and policies, with a simple description of what is covered;
  • list out the order of precedence among the different documents that govern G Suite use. If there is a contradiction between different any of these different documents, identify what document is authoritative. As just one example, the Data Processing Agreement and the G Suite Agreement define key terms like "affiliate" in slightly different ways;
  • highlight what documents are optional;
  • create a simple template for districts (or state departments of ed, or universities) to document the agreements governing a particular G Suite/Google Apps implementation;
  • standardize language used when referring to different policies;
  • define the differences between the Education-specific contracts and the Consumer contracts;
  • in each of their legal terms, create IDs that allow for linking directly to a section of a document.

While the above steps would be an improvement, creating standalone, education-specific terms that were fully independent of the consumer terms would add additional clarity. From a product development place, this legal review would force an internal review to ensure that legal terms and technical implementation were in sync. To be clear, this is an enormous undertaking, but if Google did this, it would add some much-needed clarity. Practically speaking, Google could use this step to generate some solid PR as well. The PR messaging on this practically writes itself: "Google has always prided itself on being a leader in security, data privacy, and transparency. As our products evolve and improve, we are always making sure that our agreemets evolve and improve as well."

G Suite and Advertising

Google has stated on multiple occasions that "There are no ads in the suite of G Suite core services." Here, it's worth noting that "core services" for education only includes Gmail, Google Calendar, Google Talk, Google Hangouts, Google Drive, Google Docs, Google Sheets, Google Slides, Google Forms, Google Sites, Google Contacts, and Google Vault. Other services - like Maps, Blogger, YouTube, History, and Custom Search - are not part of the core services, and are not covered under educational terms.

Ads text from Trust page

There are differences, however, between showing ads, targeting ads, and collecting data for use in profiles. Ads can be shown on the basis of the content of the page (ie, read an article about canoeing, see an ad for canoes), and this requires no information about the person reading the page.

Targeted ads use information collected from or about a user to target them, or their general demographic, with specific ads. However, while targeted ads are annoying and intrusive, they provide visual evidence that personal data is being collected and organized into a profile.

On their "Trust" page, as pictured above, Google states that "Google does not use any user personal information (or any information associated with a Google Account) to target ads."

In Google's Educational Terms, they state that they collect the following information from users of their educational services:

  • device information, such as the hardware model, operating system version, unique device identifiers, and mobile network information including phone number of the user;
  • log information, including details of how a user used our service, device event information, and the user's Internet protocol (IP) address;
  • location information, as determined by various technologies including IP address, GPS, and other sensors;

While it is great that Google states that they don't use information collected from educational users, Google also needs to provide a technical explanation that demonstrates how they ensure that IP addresses collected from students, unique IDs that are tied to student devices, and student phone numbers are explicitly excluded from advertising activity. Also, Google should clearly define what they mean when they say "advertising purposes", as this phrase is vague enough to take on many different meanings, often showing more about the opinions of the reader than the practice of Google.

This technical explanation should also include how the prohibitions against advertising based on data collected in Google Apps can square with this definition of advertising pulled from the optional Data Processing Agreement:

"'Advertising' means online advertisements displayed by Google to End Users, excluding any advertisements Customer expressly chooses to have Google or any Google Affiliate display in connection with the Services under a separate agreement (for example, Google AdSense advertisements implemented by Customer on a website created by Customer using the "Google Sites" functionality within the Services)."

There are many ways that all of these statements can be true simultaneously, but without a technically sound explanation of how this is accomplished, Google is essentially asking people to trust them with no demonstration of how this is possible.

Conclusion

Google has been working in the educational space for years, and they have put a lot of thought into their products. However, real questions still exist about how these products work, and about how data collected from kids in these products is handled. Google has created copious documentation, but - ironically - that is part of the problem, as the sheer volume of what they have created contains contradictions and repetitions with slight degrees of variance that impede understanding. Based on seeing both Google's terms evolve over the years and from seeing terms in multiple other products, these issues actually feel pretty normal. This doesn't mean that they don't need to be addressed, but I don't see malice in any of these shortcomings.

However, the concern is real, for Google and other EdTech companies: if your product supports learning today, it shouldn't support redlining and profiling tomorrow.