Programming A to Z

How Smooth is Your Echo Chamber? Final Documentation

Done in collaboration with Marijke Jorritsma (documentation was written by both of us)

Overview of the Project

“How Smooth Is Your Echo Chamber?” is a data visualization project that aims to personalize the hypothesis that social media networks create “echo chambers” of one’s own belief system.  

Based on the analysis of locations one follows on Twitter, a low-poly version of their profile picture is returned, visualizing the geographic diversity of their feed. The amount of detail or recognizability of the low-poly profile pic is determined by how homophilic their feed is; with a more homophilic feed returning a less detailed and more smooth image and a less homophilic (or more heterophilic) feed returning a more detailed and less smooth image.

The resulting image serves as a visual metaphor for how dynamic one’s resulting worldview and self-image is based on the variation of their sources.  

Background Research

Homophily, which is often described by the saying “birds of a feather flock together,” describes the tendency for individuals to seek out and associate with others who share similar attributes (Zamal, Liu, and Ruths 2012). Homophily has been a point of network study since the beginning of the twentieth century where demographic characteristics such as age, sex, race/ethnicity, and education (e.g., Bott 1929, Loomis 1946) and psychological characteristics such as intelligence, attitudes, and aspirations (e.g., Almanack 1922, Richardson 1940) were determinants in the creation of quickly forming small social networks (Mcpherson, Smith-Lovin, and Cook 2001).


Homophily has taken on renewed interest as social media networks have made it possible to collect data and track individuals self-curated social influence with more accuracy. Additionally, a trend towards finding news through social media channels rather than going directly to sources has led to research on the effects of social media in creating “echo chambers” of one’s political and social beliefs (Miller, 2014.)


Creating the Visualization - Goals and Interaction Model

The goal of our final was to distort an image of the user into a low poly version that is based on the diversity of locations of the people they follow on Twitter. The more diverse a person’s information sources is, the clearer the resulting is to reflect their understanding of the world around them.

The Twitter API refers to people that a person follows as that person’s ‘friends’, so I will be using that term as shorthand moving forward

The current interaction model for the project works like this:

  1. Users are asked directly by either Marijke or I if we have permission to use their Twitter data and a picture of them.

    1. If they say yes, we will manually store their results to a text file.

    2. If they say no, we won’t do this

  2. Users enter in their Twitter handle into a Node.js program

  3. The program outputs all of the locations that the user’s friends have reported.

At this point, the user’s work is done. The data generated is then worked on manually by Marijke and I.

  1. The location data is analyzed to account for fake locations and repeated data.

  2. The locations are then run through a second program that generates an “echo score” based on the standard deviation of the data points

  3. The echo score determines the number of points used to generate the low poly image

  4. The image is generated and returned to the user

Creating the Visualization - Data

When we first started, we were analyzing the descriptions that people were writing about themselves to see if we could find trends in the words and what we could figure out about the user based on those words. However, we ended up shifting to analyzing locations due to some of the research we found that postulated a relationship between homophilous networks and location.

The data being used is from a person’s Twitter account using the Twitter API. I made my API calls using Node.js - I’d previously made a Twitter bot using Node and Express, so I had some familiarity with how to do this. Unfortunately, in choosing to use Node I wasn't able to create a webpage interface In order to get this data, I take the user’s Twitter handle and do the following:

  1. The first API call is done to get the list of friends. This is returned as an array of ID numbers for each friend.

  2. In order to access the friend’s data, I have to pass in their IDs to get their full Twitter data. I do this in chunks of 100 IDs per API call, which is the maximum allowed by Twitter.

  3. I store each of the friends in an array as an object containing their ID, location, and description

Sample of data (with permission from Daria Bojko)

After I’ve collected all the friend’s location and description data, I use a regular expression to split them into individual words. This can sometimes be buggy - for example, if the description is in a language that doesn’t use roman letters (Korean, Chinese, Hindi, etc) then the regular expression is unable to parse it into words.

I then count the number of times a word appears independently. This can work well for descriptions but not for locations; if a location is two words, then analyzing each individual words becomes meaningless. For example - New York would be divided into two words, and the number of times New appears is different from the number of times York appears. Thus, some of the context is lost. So for location, I generated pairs of words in order to retain some of that context. I stored these into a concordance in order to track unique words - if a word was already in the concordance, the count increased by one; if not, the word was added to it.

There are a lot of different ways to describe the same location. New York City, NYC, NY NY, and New York New York all refer to the same place. I couldn’t think of an automated way to do this, so we had to manually sort through the locations to decipher what they meant so we could correctly count the number of people who were from a place. Even this lead to some problems - New York could mean New York the state or NYC. There needs to be some tweaks to the code in order to really differentiate between the two, but this was a rare case for the most part. We also had to account for fictional or joke locations (everywhere and nowhere is an example we found frequently). We treated these as not reported.

Sample of cleaned data

Once the data was cleaned, we ran it through a program Marijke wrote that calculated the total number of locations, mean, variance, standard deviation, and “echo score” - this was determined as (1 - standard deviation) * data set length. This was provided with help from Kevin Irlen. The echo score gave us the number of points to use in generating the low poly image, which Marijke used a program that would create it with the given input of points.

Running data through analysis program

Case Study - My Data

At the time I ran this test, I was following a little over 400 people. Many of them are in fields that I am particularly interested in: user experience, museums, comics, and games, while some are people I know personally. I grew up in the suburbs of Chicago, but when I joined Twitter I was interning in Silicon Valley.

Here are my top five results:

1: not provided - 98 people

2: new york - 51 people

3: san francisco - 41 people

4: los angeles - 25 people

5: francisco ca - 23 people

Not Provided being the top result doesn’t surprise me since Twitter doesn’t require location reporting. The three most common locations are New York, San Francisco and LA. ‘Francisco CA’ is a duplication of SF so it won’t count. I was only a little surprised by LA since I have never lived there, but then I realized that that’s where a lot of television series accounts would be located.

My stats are in the screenshot above - I had 74 total locations after cleaning the data, with a very low standard deviation indicating that most of my data was clustered together. I ended up with an echo score of 71.14, which was rounded down to 71 points.

Marijke took my Facebook and Twitter profile pictures and made low poly versions with 71 points for each location. I thought that was pretty good until I saw what that actually translated to in modifying my images.

It’s not as diverse as I thought it would be, although I do have a relatively small pool of topics that I use Twitter to learn about. Still, my face is kind of discernible in both pictures, and when I posted the result to Facebook I got some interesting comments about it. People who were unfamiliar with my project found the image unsettling.

In terms of interactivity, at the stage we can show people where their friends are from using a prototype where people can hover over the points in the image to see the unique locations that their friends are from. The link below is a sample using my results of how we picture people seeing the location data in addition to the image - to see a location, you can hover over a point. Unfortunately the program Marijke used didn't seem to allow for including the number in addition to the location, but that data could be made available in later iterations depending on how we build it.

Link to Interactive Image

Link to Interactive Graph of where my Twitter friends are from

Next Steps

In an ideal world, everything that Marijke and I do manually would be automated - the location analysis would require some natural language processing and machine learning to find out what words map to which location and what is not actually a location.

We would also like to analyze some of the additional factors we found in the research that lead to more homophilous networks, and the best source of this data would be Facebook over Twitter. While both rely on self reporting, Facebook has data inputs for these factors whereas Twitter only provides a limited description and an optional input for location. While Twitter’s description text is an interesting resource, it’s put through the filter of the person writing it. Data gender, education, etc through Facebook however is less nebulous.

So the interaction model would look like this:

  1. The user logs into an application - Facebook, mobile, some form factor to be determined - with Facebook

  2. Data about the user’s friends is scraped to account for location, race/ethnicity, age (from birthday), education, occupation, and gender.

  3. The user’s profile picture would be copied and put through the program to be edited into the low poly format.

  4. The user would get their picture along with a breakdown of their data explaining what it all means and how it affects the way they learn about the world around them.

We had simplified this stage of the project to analyze just location in order to understand how to calculate the echo score on a factor that was more quantitative and would not introduce bias (although we did do so in the manual sorting of the data), but we hope to scale that up to include the six factors we didn’t get to in order to create a more accurate picture. We think that this is a machine learning problem to parse and sort the data, and the calculations would definitely need more statistical analysis than either of us are familiar with.  
 

As for where this project is going in the future, funnily enough Marijke found someone who did a project about estimating ideological positions based on Twitter data on the day we’re presenting this to the class - he wrote an R file that analyzes political alignment based on politicians followed on Twitter, which is a pretty similar research question to the one we asked in developing this project. We are also applying to the Data Science Incubator at NYU’s Data Science Center - there are a lot of engineering aspects that we need help on that we think the incubator would be best suited to helping us with. 


The biggest takeaway was seeing how small our exposure to the world is even when we think we’re well informed. By no means will I start reading conservative news sources regularly, but seeing my own features blurred by the program almost to the point of being unrecognizable was a big surprise. I’m curious to see how other factors will affect my results in the future. Although Twitter isn’t the data source we want in the future, the words we found in the self descriptions and the locations revealed a lot about the people we follow. Somewhere in those words, we were able to find the pieces of ourselves too.

Sources

Social Media Deepens Partisan Divides. But Not Always

http://www.nytimes.com/2014/11/21/upshot/social-media-deepens-partisan-divides-but-not-always.html?_r=1

 

Birds of a Feather

http://tuvalu.santafe.edu/~aaronc/courses/5352/readings/McPherson_Smith-Lovin_Cook_01_BirdsOfAFeature_HomophilyInSocialNetworks.pdf

 

Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data

http://pan.oxfordjournals.org/content/23/1/76.full.pdf?keytype=ref&ijkey=uMFPw4dsMHM7608

 

Homophily and Latent Attribute Inference

https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/viewFile/4713/5013

 

How Diverse is Facebook?

https://m.facebook.com/notes/facebook-data-science/how-diverse-is-facebook/205925658858

 

Using Lists to Measure Homophily on Twitter

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.302.4452&rep=rep1&type=pdf

 

A Robust Gender Inference Model

http://firstmonday.org/ojs/index.php/fm/article/view/5216/4113

 

Preference, Homophily, and Social Learning

http://pages.stern.nyu.edu/~ilobel/PreferencesHomophily.pdf

 

Dan Shiffman, Word Counting

http://shiffman.github.io/A2Z-F15/week5/notes.html

 

Why Buzzfeed is Trying to Shift It Strategy

http://mobile.nytimes.com/2014/08/13/upshot/why-buzzfeed-is-trying-to-shift-its-strategy.html?referrer=

Final Project Pitch

I have two ideas, both of which I really love but one that I think would be really interesting in terms of text analysis

The first is to keep building on my "But No Black Widow Movie" Twitter bot. I would like to increase the complexity of generated tweets; currently the tweets follow the format of "A <male Marvel character> <movie type> announced for Phase <number>, #butnoblackwidowmovie?". What I'm thinking about adding is: 

A <Marvel character> sequel featuring <Marvel character> 

<Marvel character or group> teams up with <Marvel character or group> 

And any other variants there of. I also wanted to utilize the WTF Engine to do some fun text generation. 

My other idea that I'm thinking about more strongly is in collaboration with someone in my data visualization class; we were interested in the diversity of viewpoints in the content people consume on Twitter - does social media create an echo chamber of homogenous opinion/perspective? We do this by creating a low polygon image of a person that's colored based on the variety of topics/perspectives they are exposed to on Twitter. 

The user will take a selfie and give us their Twitter handle - based on that, we'll pull their list of 'friends' (people they follow according to the Twitter API). We'll pull their text descriptions and strip out the common words to get only the keywords and hashtags they use. We'll then classify those words (manually or programmatically) and generate a Voronoi diagram of their interests. 

The selfie the user takes will be converted to greyscale and then tesselated into a low poly version. We'll overlay the generated Voronoi diagram on top to get color values; the more colors, the more complex and diverse the network of viewpoints the user is exposed to.

Closures Exercise

Link to assignment demo

I think I have the hang of closures (sort of... maybe...), but the exercises were an exercise in debugging as well. Let's see how this went.

Exercise 1

Use a closure to create 100 DOM elements with setTimeout()

The text given was wrong, and I was able to get through it after a couple of iterations on the code. I first had some mistakes with creating the loop and setTimeout correctly and had the instantaneous computation, but the key seemed to weirdly be in creating a separate count variable that could iterate outside of the loop rather than passing in an argument (which is where the second exercise is giving me trouble).

Here's the code excerpt

makeElements(100);

function makeElements(input) 
{

    var count = 0;

    for (var i = 0; i < 100; i++) 
    {
        setTimeout(makeElt, i * 100);
    }

    

    setTimeout(exerciseTwoSetup,2000);

    function makeElt() 
    {
        var divElement = document.createElement('div');
        console.log(divElement.innerHTML);
        var divText = document.createTextNode('Number: ' + (count+1));
        divElement.appendChild(divText);
        console.log(divElement.innerHTML);
        document.body.appendChild(divElement);

        count++;
    }
    
}

Exercise 2

Use a closure to animate a DOM element in some way with the style() function. (Fill in the blanks).

I am SO CLOSE to figuring this one out! I first started out with making the Exercise 2 h3 header clickable so that it would change color and then realized I missed the point of the prompt, which was scale. So I created multiple divs with a class to mark that they were clickable, but weirdly my event handler was triggering so that every other element was already flashing colors when they were first created. Here's my code snippet: 

//Animation exercise
function exerciseTwoSetup()
{
    //create the Exercise 2 header
    document.body.appendChild(document.createElement('p'));

    var headerTitle = document.createTextNode('Exercise 2');
    var headerElement = document.createElement('h3');
    headerElement.appendChild(headerTitle);
    headerElement.id = "exercisetwo";
    document.body.appendChild(headerElement);

    //create the clickable divs to flash colors when clicked on
    for(var k = 0; k <= 20; k++)
    {
        console.log('creating element');
        var divElement = document.createElement('div');
        var divText = document.createTextNode('Click on Me!');

        divElement.appendChild(divText);
        divElement.className = "clickable";
        divElement.id = "click" + k;
        divElement.style.padding = "15px";
        document.body.appendChild(divElement);
    }
    
    //add event handler
    var elements = document.getElementsByClassName("clickable");

    for(var j = 0; j < elements.length; j++)
    {
        elements[j].addEventListener("click",animate);
    }
    
}

function animate() 
{
    // console.log("called for: " + document.getElementById(this.id));
    
    //check to see if element has been clicked on previously to turn color flashing on or off
    if(document.getElementById(this.id).className === "clicked")
    {
        clearInterval(interval); 
        document.getElementById(this.id).style.color = "black";
        document.getElementById(this.id).className = "clickable";   
    }   
    else
    {
        console.log("turning color on.");
        document.getElementById(this.id).className = "clicked";
        interval = setInterval(changeColor, 250);
    }

    function changeColor() 
    {
        //BUG! unable to change color since this keyword returns null
        this.style.color = getRandomColor();
        
    }

    function getRandomColor() //taken from http://stackoverflow.com/questions/1484506/random-color-generator-in-javascript
    {
        var letters = '0123456789ABCDEF'.split('');
        var color = '#';
        for (var i = 0; i < 6; i++ ) 
        {
            color += letters[Math.floor(Math.random() * 16)];
        }
        return color;
    }

Where I've got it now is successfully triggering the closure but actually not being able to create the random color; I'm using the this keyword but that's not actually getting passed into the color generation function. 

Exercise 3

Use a closure to make an API call to openweathermap.org. Send openweathermap a zip code and when the weather is returned, create a DOM element with that zip code and the weather data.

I ran out of time to work on the second part of this because I spent time wrestling with p5. I cannot for whatever reason make any p5 calls, which is why all my exercises are done without it. It's such a shame because the code is much simpler than what I implemented. I'm not sure why that is, I forked off the example code so the folder paths are preserved.

In any case, I resorted to jQuery in desperation and was able to parse through the data structure returned by the Open Weather Maps. Note to self: when using an API, there's 100% chance you have to register to get a key. 

Code is here: 

function assignQuery() 
{
    var headerTitle = document.createTextNode('Exercise 3');
    var headerElement = document.createElement('h3');
    headerElement.appendChild(headerTitle);
    headerElement.id = "exercisethree";
    document.body.appendChild(headerElement);

    var url = 'http://api.openweathermap.org/data/2.5/weather?zip='+ zip + ',us&appid=2de143494c0b295cca9337e1e96b00e0';
  
    $.getJSON(url, gotData);

  function gotData(data) 
  {
    console.log(data);

    var weatherDiv = document.createElement('div');
    var weatherText = document.createTextNode(data.weather[0].description + " for zip code " + zip);

    weatherDiv.appendChild(weatherText);
    document.body.appendChild(weatherDiv);
  }

}

To use multiple zip codes, I imagine you'd probably store an array of the zip codes that the user would want to see and then call the function on each element in the array, but I guess that's maybe sidestepping the closure functionality? Is this better performance wise?

In conclusion, I thought I knew what I was doing. Perhaps not so much.

If it weren't for the typo, this would be the summary of my relationship with closures

If it weren't for the typo, this would be the summary of my relationship with closures


A Game of Pride and Prejudice - Using the Markov Chain Mixer

I built off the Markov Chain mixer example to mashup Pride and Prejudice with A Game of Thrones. What I found in my experimentation was that

  • (Cross reference origins are very confusing - I had to upload the modified code to Github in order to test it)
  • The runtime for the full texts took a very long time, which makes sense because the code has to generate n-grams for the entire text. So I mashed up the first chapters of each text instead.
  • Words that don't exist in either text along with the English dictionary were created. For example, I saw the word 'Michaelmaster' come up.
  • Even if I pushed in the direction of one text more than the other, sometimes I'd get a result that would lean more heavily to the other text simply because a random selected n-gram would extremely unique to that text and that would influence the results going forward (see my last screenshot)

Within the generate() function in sketch.js, the text is weighted based on the slider position, but what happens if we seed the random n-gram selection function? What if we could influence the choice() function call? Or does that defeat the purpose of a Markov chain?

More often then not I got a result that leaned more heavily towards A Game of Thrones as I played with the mixer, but I think that probability wise I had more variety in the n-grams generated by Game of Thrones than by Pride and Prejudice; in the first chapter of PnP for example we get a lot of instances of Bennet, but the first chapter of GoT is from Bran's point of view, so for n-grams containing ' B' you could either get Bran or Bennet. Initially I thought that the probability of selecting Bran was higher due to the text reading Bennet as Mr.Bennet instead of Mr. Bennet with a space (for example, there was also Mrs. Bennet); but I think that maybe the length of the text used for Game of Thrones is longer than what I used for Pride and Prejudice, allowing more instances of n-grams generated by the word 'Bran' to be pushed into the array. 


Midterm Documentation

Link to midterm

Link to slides

Overview

My goal was to build a visualization that shows the breakdown of the parts of speech within a user inputed passage of text, specifically a classic novel. This midterm covers the first step - showing the total counts of the parts of speech in a novel - in a project that I'd like to continue building out over the rest of the semester. This was built for the general audience to do a first look at how complex a classic novel is when broken down to its words. The question I ask with this project is: does the increased use of certain parts of speech correlate to the readability of classic literature?

Background

Schools teach certain novels based on their thematic relevance, but what interests me is the complexity of these novels from a grammar standpoint. The first noted study of classic literature readibility was in 1880 by Lucius Adelno Sherman, who noticed that the length of sentences in novels had changed from 50 words per sentence in the Pre-Elizabethan Era to 23 words per sentence in his time (DuBay, 2007). Although the studies in the article looked at vocabulary, I am interested to see if there are correlations in the usage of parts of speech to readability.

A common measure of readability is done through the Flesch-Kincaid test. This test generates a score that estimates the readability of a text based on its word, sentence, and syllable counts. Syllables are an interesting factor here that aren't repeated in any of the other tests from the DuBay reading, but I couldn't determine if there is a way to correlate syllables to parts of speech. 

Creating the Visualization: Goal and Interaction Model

The goals of the visualization was:

  • Count the number of times the parts of speech occurred
  • Compare the times a part of speech occurred to others
  • Show links between parts of speech to indicate which parts of speech were strongly connected based on the probability of finding a particular part of speech after one has been read in the text

The last goal was a stretch goal and is actually being covered later in the semester by Dan Shiffman's Programming A to Z class, which looks at text analysis methods. I'm combining this midterm with the requirements of that class in order to produce an visualization in which the user provides the data. Here is the interaction model

  • User enters the title of the novel they want to analyze
  • User enters the text of the novel to analyze
    • They copy-paste the text into a text field
    • They upload a text file (.txt)
  • The visualization is generated
  • The user hovers over each bubble to get more information about the data. 

Creating the Visualization: Data

I came across Project Gutenberg as I was searching for literary APIs - the site provides free ebooks of many classic novels with very few limitations for its use as the copyrights for many of the books have expired. As each novel is well over a thousand words, they make a very good data set. I can also compare multiple works by the same author to get data about that author.

Creating the Visualization: Development

I built the visualization using the following technologies: 

  1. RiTa - a JavaScript library that deals with analyzing text grammatically
  2. D3 - I built the bubble chart with this library

I had three initial pieces to work with in developing this project. These were the main building blocks for this first iteration of the project: 

At a high level, here's how the code works

- Get the user input
    - Check that the user has put in a valid file type
    - Check that the user has entered a title and file
- Generate the data
    - Split the text file into an array of strings (remove punctuation and tokenize)
    - Loop through each word to the find the part of speech
    - Store the count of each part of speech to a dictionary and push the key to an array
    - Find the general type of part of speech (adj, adv, noun, verb) for each word
- Create the visualization
    - Pass in the arrays of data
    - Generate the bubbles and draw the SVG element

Data Collected - A Sample Case Study

Since Pride and Prejudice is my favorite classic, I decided to use a small sample of Jane Austen's work to see what data I could generate and if there are any patterns in her writing that could lead to understanding the complexity of her writing. First I used Shiffman's Flesch Index Calculator to generate the following table

Title (Year Published) - Flesch Score

Pride and Prejudice (1813) - 60.27
Mansfield Park (1814) - 62.33
Emma (1815) - 58.07
Persuasion (1817) - 60.08
Northanger Abbey (1817) - 59.72

The scores remain closely clustered together.

Here are the visualizations I generated (click to expand)

She uses quite a lot of prepositions or subordinating conjunctions (after, although, provided that) and singular/mass nouns, which may generate a lot of syllables that would increase the Flesch Kincaid score. However, each novel as a data set is massive in terms of word count so that may be the primary driving factor in the complexity. 

Creating the Visualization: Next Steps

I built this visualization mostly for myself (and also partially because my previous concepts are unrealizable dreams), but I think this visualization would be interesting to people who would like to analyze text complexity or how novels are built (kind of an engineering take on literature)

I would like to increase the complexity of the graph by using a pack hierarchy to group together related parts of speech. I collected the simple part of speech breakdown groups but wasn't fully able to map the different parts of speech correctly to each of these groups with the dictionaries that I built.

The following step I wanted to accomplish was to build a force directed graph showing which parts of speech most commonly appear together, with a link being created if one part of speech appears before/after another. This seems to be related to Markov chain creations, which I discussed with Dan Shiffman briefly and will be covering later in the semester. I hope to build on this project when I cover that material.

Sources

DuBay, William H. Unlocking Language: The Classic Readability Studies. Costa Mesa, CA: Impact Information, 2007. Print.

Hoka, Brenda Lynn. "Comparison of Recreational Reading Books Levels Using the Fry Readability Graph and the Flesch-Kincaid Grade Level." (1999): n. pag. Web. 14 Oct. 2015.

Shiffman, Daniel. "A to Z F15 Week 2 Notes." Programming A to Z Fall 2015. N.p., n.d. Web. 15 Oct. 2015.