How Smooth is Your Echo Chamber? Final Documentation

Done in collaboration with Marijke Jorritsma (documentation was written by both of us)

Overview of the Project

“How Smooth Is Your Echo Chamber?” is a data visualization project that aims to personalize the hypothesis that social media networks create “echo chambers” of one’s own belief system.  

Based on the analysis of locations one follows on Twitter, a low-poly version of their profile picture is returned, visualizing the geographic diversity of their feed. The amount of detail or recognizability of the low-poly profile pic is determined by how homophilic their feed is; with a more homophilic feed returning a less detailed and more smooth image and a less homophilic (or more heterophilic) feed returning a more detailed and less smooth image.

The resulting image serves as a visual metaphor for how dynamic one’s resulting worldview and self-image is based on the variation of their sources.  

Background Research

Homophily, which is often described by the saying “birds of a feather flock together,” describes the tendency for individuals to seek out and associate with others who share similar attributes (Zamal, Liu, and Ruths 2012). Homophily has been a point of network study since the beginning of the twentieth century where demographic characteristics such as age, sex, race/ethnicity, and education (e.g., Bott 1929, Loomis 1946) and psychological characteristics such as intelligence, attitudes, and aspirations (e.g., Almanack 1922, Richardson 1940) were determinants in the creation of quickly forming small social networks (Mcpherson, Smith-Lovin, and Cook 2001).


Homophily has taken on renewed interest as social media networks have made it possible to collect data and track individuals self-curated social influence with more accuracy. Additionally, a trend towards finding news through social media channels rather than going directly to sources has led to research on the effects of social media in creating “echo chambers” of one’s political and social beliefs (Miller, 2014.)


Creating the Visualization - Goals and Interaction Model

The goal of our final was to distort an image of the user into a low poly version that is based on the diversity of locations of the people they follow on Twitter. The more diverse a person’s information sources is, the clearer the resulting is to reflect their understanding of the world around them.

The Twitter API refers to people that a person follows as that person’s ‘friends’, so I will be using that term as shorthand moving forward

The current interaction model for the project works like this:

  1. Users are asked directly by either Marijke or I if we have permission to use their Twitter data and a picture of them.

    1. If they say yes, we will manually store their results to a text file.

    2. If they say no, we won’t do this

  2. Users enter in their Twitter handle into a Node.js program

  3. The program outputs all of the locations that the user’s friends have reported.

At this point, the user’s work is done. The data generated is then worked on manually by Marijke and I.

  1. The location data is analyzed to account for fake locations and repeated data.

  2. The locations are then run through a second program that generates an “echo score” based on the standard deviation of the data points

  3. The echo score determines the number of points used to generate the low poly image

  4. The image is generated and returned to the user

Creating the Visualization - Data

When we first started, we were analyzing the descriptions that people were writing about themselves to see if we could find trends in the words and what we could figure out about the user based on those words. However, we ended up shifting to analyzing locations due to some of the research we found that postulated a relationship between homophilous networks and location.

The data being used is from a person’s Twitter account using the Twitter API. I made my API calls using Node.js - I’d previously made a Twitter bot using Node and Express, so I had some familiarity with how to do this. Unfortunately, in choosing to use Node I wasn't able to create a webpage interface In order to get this data, I take the user’s Twitter handle and do the following:

  1. The first API call is done to get the list of friends. This is returned as an array of ID numbers for each friend.

  2. In order to access the friend’s data, I have to pass in their IDs to get their full Twitter data. I do this in chunks of 100 IDs per API call, which is the maximum allowed by Twitter.

  3. I store each of the friends in an array as an object containing their ID, location, and description

Sample of data (with permission from Daria Bojko)

After I’ve collected all the friend’s location and description data, I use a regular expression to split them into individual words. This can sometimes be buggy - for example, if the description is in a language that doesn’t use roman letters (Korean, Chinese, Hindi, etc) then the regular expression is unable to parse it into words.

I then count the number of times a word appears independently. This can work well for descriptions but not for locations; if a location is two words, then analyzing each individual words becomes meaningless. For example - New York would be divided into two words, and the number of times New appears is different from the number of times York appears. Thus, some of the context is lost. So for location, I generated pairs of words in order to retain some of that context. I stored these into a concordance in order to track unique words - if a word was already in the concordance, the count increased by one; if not, the word was added to it.

There are a lot of different ways to describe the same location. New York City, NYC, NY NY, and New York New York all refer to the same place. I couldn’t think of an automated way to do this, so we had to manually sort through the locations to decipher what they meant so we could correctly count the number of people who were from a place. Even this lead to some problems - New York could mean New York the state or NYC. There needs to be some tweaks to the code in order to really differentiate between the two, but this was a rare case for the most part. We also had to account for fictional or joke locations (everywhere and nowhere is an example we found frequently). We treated these as not reported.

Sample of cleaned data

Once the data was cleaned, we ran it through a program Marijke wrote that calculated the total number of locations, mean, variance, standard deviation, and “echo score” - this was determined as (1 - standard deviation) * data set length. This was provided with help from Kevin Irlen. The echo score gave us the number of points to use in generating the low poly image, which Marijke used a program that would create it with the given input of points.

Running data through analysis program

Case Study - My Data

At the time I ran this test, I was following a little over 400 people. Many of them are in fields that I am particularly interested in: user experience, museums, comics, and games, while some are people I know personally. I grew up in the suburbs of Chicago, but when I joined Twitter I was interning in Silicon Valley.

Here are my top five results:

1: not provided - 98 people

2: new york - 51 people

3: san francisco - 41 people

4: los angeles - 25 people

5: francisco ca - 23 people

Not Provided being the top result doesn’t surprise me since Twitter doesn’t require location reporting. The three most common locations are New York, San Francisco and LA. ‘Francisco CA’ is a duplication of SF so it won’t count. I was only a little surprised by LA since I have never lived there, but then I realized that that’s where a lot of television series accounts would be located.

My stats are in the screenshot above - I had 74 total locations after cleaning the data, with a very low standard deviation indicating that most of my data was clustered together. I ended up with an echo score of 71.14, which was rounded down to 71 points.

Marijke took my Facebook and Twitter profile pictures and made low poly versions with 71 points for each location. I thought that was pretty good until I saw what that actually translated to in modifying my images.

It’s not as diverse as I thought it would be, although I do have a relatively small pool of topics that I use Twitter to learn about. Still, my face is kind of discernible in both pictures, and when I posted the result to Facebook I got some interesting comments about it. People who were unfamiliar with my project found the image unsettling.

In terms of interactivity, at the stage we can show people where their friends are from using a prototype where people can hover over the points in the image to see the unique locations that their friends are from. The link below is a sample using my results of how we picture people seeing the location data in addition to the image - to see a location, you can hover over a point. Unfortunately the program Marijke used didn't seem to allow for including the number in addition to the location, but that data could be made available in later iterations depending on how we build it.

Link to Interactive Image

Link to Interactive Graph of where my Twitter friends are from

Next Steps

In an ideal world, everything that Marijke and I do manually would be automated - the location analysis would require some natural language processing and machine learning to find out what words map to which location and what is not actually a location.

We would also like to analyze some of the additional factors we found in the research that lead to more homophilous networks, and the best source of this data would be Facebook over Twitter. While both rely on self reporting, Facebook has data inputs for these factors whereas Twitter only provides a limited description and an optional input for location. While Twitter’s description text is an interesting resource, it’s put through the filter of the person writing it. Data gender, education, etc through Facebook however is less nebulous.

So the interaction model would look like this:

  1. The user logs into an application - Facebook, mobile, some form factor to be determined - with Facebook

  2. Data about the user’s friends is scraped to account for location, race/ethnicity, age (from birthday), education, occupation, and gender.

  3. The user’s profile picture would be copied and put through the program to be edited into the low poly format.

  4. The user would get their picture along with a breakdown of their data explaining what it all means and how it affects the way they learn about the world around them.

We had simplified this stage of the project to analyze just location in order to understand how to calculate the echo score on a factor that was more quantitative and would not introduce bias (although we did do so in the manual sorting of the data), but we hope to scale that up to include the six factors we didn’t get to in order to create a more accurate picture. We think that this is a machine learning problem to parse and sort the data, and the calculations would definitely need more statistical analysis than either of us are familiar with.  
 

As for where this project is going in the future, funnily enough Marijke found someone who did a project about estimating ideological positions based on Twitter data on the day we’re presenting this to the class - he wrote an R file that analyzes political alignment based on politicians followed on Twitter, which is a pretty similar research question to the one we asked in developing this project. We are also applying to the Data Science Incubator at NYU’s Data Science Center - there are a lot of engineering aspects that we need help on that we think the incubator would be best suited to helping us with. 


The biggest takeaway was seeing how small our exposure to the world is even when we think we’re well informed. By no means will I start reading conservative news sources regularly, but seeing my own features blurred by the program almost to the point of being unrecognizable was a big surprise. I’m curious to see how other factors will affect my results in the future. Although Twitter isn’t the data source we want in the future, the words we found in the self descriptions and the locations revealed a lot about the people we follow. Somewhere in those words, we were able to find the pieces of ourselves too.

Sources

Social Media Deepens Partisan Divides. But Not Always

http://www.nytimes.com/2014/11/21/upshot/social-media-deepens-partisan-divides-but-not-always.html?_r=1

 

Birds of a Feather

http://tuvalu.santafe.edu/~aaronc/courses/5352/readings/McPherson_Smith-Lovin_Cook_01_BirdsOfAFeature_HomophilyInSocialNetworks.pdf

 

Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data

http://pan.oxfordjournals.org/content/23/1/76.full.pdf?keytype=ref&ijkey=uMFPw4dsMHM7608

 

Homophily and Latent Attribute Inference

https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/viewFile/4713/5013

 

How Diverse is Facebook?

https://m.facebook.com/notes/facebook-data-science/how-diverse-is-facebook/205925658858

 

Using Lists to Measure Homophily on Twitter

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.302.4452&rep=rep1&type=pdf

 

A Robust Gender Inference Model

http://firstmonday.org/ojs/index.php/fm/article/view/5216/4113

 

Preference, Homophily, and Social Learning

http://pages.stern.nyu.edu/~ilobel/PreferencesHomophily.pdf

 

Dan Shiffman, Word Counting

http://shiffman.github.io/A2Z-F15/week5/notes.html

 

Why Buzzfeed is Trying to Shift It Strategy

http://mobile.nytimes.com/2014/08/13/upshot/why-buzzfeed-is-trying-to-shift-its-strategy.html?referrer=