Data Viz from 2D-4D

Timeline.js Exercise - Retrospective

Of all the tools we used in class, this was the one I was most disappointed in because the experience of using it was far more painful than expected. I'd known about Timeline.js prior to the class since someone had sent out a link to it earlier in the semester and had thought about using it around midterms when I was still thinking about doing a Marvel themed project.

I worked with Hovsep and Gabe to build a timeline of John Boehner's resignation as Speaker of the House. I wasn't able to find the interface we were using in class that day, but I did find the spreadsheets we manipulated. The fact that I had to switch email accounts from my NYU one to my Gmail was a point of frustration - this probably had more to do with security restrictions, but it didn't make a good first use impression. As a UX designer, I expect that if I ran a usability test that the majority of people would give up using the tool at this point because the barrier to entry was high and there wasn't an obvious workaround.

Screenshot of spreadsheet

We ended up not using the interface for the most part to create the timeline; instead, we manually put entries into the timeline. Even by following this methodology the opportunity for error was high (at least, we had a lot of errors). For example, there was some confusion as to what the media field was. We were putting the sources there thinking it would pull the image from the articles, which led to errors being generated. I think that was more obvious in the interface since you had to upload a media file, but that was another pain point for our team.

There's the saying that people won't remember what you've said but they'll remember how they made you feel. Though I understand that the interface is a work in progress, it was a struggle for me to revisit this particular exercise because of the frustration I had with the tool. The current solution is not ideal - I realize I'm probably not the use case this tool is designed for, but there are many usability problems. The intent behind the interface to try and show the user a live updating version of the timeline as they enter data in is a good one, but it still has a long way to go. As for the tool actually fulfilling its purpose, which is to create an interactive timeline of data, I would much prefer to make one out of HTML/CSS myself in the future. The QGIS tool gave me similar frustrations to this one, but there's no denying its capabilities and I can't think of an easier alternative to use that would be a time-effective replacement. For this tool however, there seem to be other timeline building tools such as HSTRY and MyHistro. I needed to have a code to sign up for these, so that makes me think that there's licesnsing fees involved whereas Timeline.js is free and thus good for non-profits or organizations that have a small tech budget. But if I were to build my own timeline, I'd stick to making a one page app in HTML and CSS.


Boehner Walks the Timeline - Timeline.js Exercise

Link to spreadsheet


Final Project Pitch

Here is the original project pitch for the final (taken from the email we sent)

Marijke and I are working together to create a project around self-selection and social media, specifically how social media creates an echo chamber of homogenous opinion/perspective/reference, etc.  We will analyze the Twitter accounts a person follows to determine the diversity of content they consume. The resulting visualization will be a polygonal self-portrait of the user, the complexity of which will be determined by the diversity of their twitter content consumption. The following is our workflow:

 

1 - The user takes a picture of themselves and gives us their Twitter handle

2 - We find the list of people they follow and analyze the text description of each account

3 - Based on the keywords in the text, we create tags to group together the accounts

4 - We generate a low poly version of the image based on the groupings to show the user the diversity of accounts they follow. 

 

We're not sure how to tag and group accounts so that we can build a graph that represents this diversity, so we set some time up to chat with Arlene before the following class to figure out where to go.

How Smooth is Your Echo Chamber? Final Documentation

Done in collaboration with Marijke Jorritsma (documentation was written by both of us)

Overview of the Project

“How Smooth Is Your Echo Chamber?” is a data visualization project that aims to personalize the hypothesis that social media networks create “echo chambers” of one’s own belief system.  

Based on the analysis of locations one follows on Twitter, a low-poly version of their profile picture is returned, visualizing the geographic diversity of their feed. The amount of detail or recognizability of the low-poly profile pic is determined by how homophilic their feed is; with a more homophilic feed returning a less detailed and more smooth image and a less homophilic (or more heterophilic) feed returning a more detailed and less smooth image.

The resulting image serves as a visual metaphor for how dynamic one’s resulting worldview and self-image is based on the variation of their sources.  

Background Research

Homophily, which is often described by the saying “birds of a feather flock together,” describes the tendency for individuals to seek out and associate with others who share similar attributes (Zamal, Liu, and Ruths 2012). Homophily has been a point of network study since the beginning of the twentieth century where demographic characteristics such as age, sex, race/ethnicity, and education (e.g., Bott 1929, Loomis 1946) and psychological characteristics such as intelligence, attitudes, and aspirations (e.g., Almanack 1922, Richardson 1940) were determinants in the creation of quickly forming small social networks (Mcpherson, Smith-Lovin, and Cook 2001).


Homophily has taken on renewed interest as social media networks have made it possible to collect data and track individuals self-curated social influence with more accuracy. Additionally, a trend towards finding news through social media channels rather than going directly to sources has led to research on the effects of social media in creating “echo chambers” of one’s political and social beliefs (Miller, 2014.)


Creating the Visualization - Goals and Interaction Model

The goal of our final was to distort an image of the user into a low poly version that is based on the diversity of locations of the people they follow on Twitter. The more diverse a person’s information sources is, the clearer the resulting is to reflect their understanding of the world around them.

The Twitter API refers to people that a person follows as that person’s ‘friends’, so I will be using that term as shorthand moving forward

The current interaction model for the project works like this:

  1. Users are asked directly by either Marijke or I if we have permission to use their Twitter data and a picture of them.

    1. If they say yes, we will manually store their results to a text file.

    2. If they say no, we won’t do this

  2. Users enter in their Twitter handle into a Node.js program

  3. The program outputs all of the locations that the user’s friends have reported.

At this point, the user’s work is done. The data generated is then worked on manually by Marijke and I.

  1. The location data is analyzed to account for fake locations and repeated data.

  2. The locations are then run through a second program that generates an “echo score” based on the standard deviation of the data points

  3. The echo score determines the number of points used to generate the low poly image

  4. The image is generated and returned to the user

Creating the Visualization - Data

When we first started, we were analyzing the descriptions that people were writing about themselves to see if we could find trends in the words and what we could figure out about the user based on those words. However, we ended up shifting to analyzing locations due to some of the research we found that postulated a relationship between homophilous networks and location.

The data being used is from a person’s Twitter account using the Twitter API. I made my API calls using Node.js - I’d previously made a Twitter bot using Node and Express, so I had some familiarity with how to do this. Unfortunately, in choosing to use Node I wasn't able to create a webpage interface In order to get this data, I take the user’s Twitter handle and do the following:

  1. The first API call is done to get the list of friends. This is returned as an array of ID numbers for each friend.

  2. In order to access the friend’s data, I have to pass in their IDs to get their full Twitter data. I do this in chunks of 100 IDs per API call, which is the maximum allowed by Twitter.

  3. I store each of the friends in an array as an object containing their ID, location, and description

Sample of data (with permission from Daria Bojko)

After I’ve collected all the friend’s location and description data, I use a regular expression to split them into individual words. This can sometimes be buggy - for example, if the description is in a language that doesn’t use roman letters (Korean, Chinese, Hindi, etc) then the regular expression is unable to parse it into words.

I then count the number of times a word appears independently. This can work well for descriptions but not for locations; if a location is two words, then analyzing each individual words becomes meaningless. For example - New York would be divided into two words, and the number of times New appears is different from the number of times York appears. Thus, some of the context is lost. So for location, I generated pairs of words in order to retain some of that context. I stored these into a concordance in order to track unique words - if a word was already in the concordance, the count increased by one; if not, the word was added to it.

There are a lot of different ways to describe the same location. New York City, NYC, NY NY, and New York New York all refer to the same place. I couldn’t think of an automated way to do this, so we had to manually sort through the locations to decipher what they meant so we could correctly count the number of people who were from a place. Even this lead to some problems - New York could mean New York the state or NYC. There needs to be some tweaks to the code in order to really differentiate between the two, but this was a rare case for the most part. We also had to account for fictional or joke locations (everywhere and nowhere is an example we found frequently). We treated these as not reported.

Sample of cleaned data

Once the data was cleaned, we ran it through a program Marijke wrote that calculated the total number of locations, mean, variance, standard deviation, and “echo score” - this was determined as (1 - standard deviation) * data set length. This was provided with help from Kevin Irlen. The echo score gave us the number of points to use in generating the low poly image, which Marijke used a program that would create it with the given input of points.

Running data through analysis program

Case Study - My Data

At the time I ran this test, I was following a little over 400 people. Many of them are in fields that I am particularly interested in: user experience, museums, comics, and games, while some are people I know personally. I grew up in the suburbs of Chicago, but when I joined Twitter I was interning in Silicon Valley.

Here are my top five results:

1: not provided - 98 people

2: new york - 51 people

3: san francisco - 41 people

4: los angeles - 25 people

5: francisco ca - 23 people

Not Provided being the top result doesn’t surprise me since Twitter doesn’t require location reporting. The three most common locations are New York, San Francisco and LA. ‘Francisco CA’ is a duplication of SF so it won’t count. I was only a little surprised by LA since I have never lived there, but then I realized that that’s where a lot of television series accounts would be located.

My stats are in the screenshot above - I had 74 total locations after cleaning the data, with a very low standard deviation indicating that most of my data was clustered together. I ended up with an echo score of 71.14, which was rounded down to 71 points.

Marijke took my Facebook and Twitter profile pictures and made low poly versions with 71 points for each location. I thought that was pretty good until I saw what that actually translated to in modifying my images.

It’s not as diverse as I thought it would be, although I do have a relatively small pool of topics that I use Twitter to learn about. Still, my face is kind of discernible in both pictures, and when I posted the result to Facebook I got some interesting comments about it. People who were unfamiliar with my project found the image unsettling.

In terms of interactivity, at the stage we can show people where their friends are from using a prototype where people can hover over the points in the image to see the unique locations that their friends are from. The link below is a sample using my results of how we picture people seeing the location data in addition to the image - to see a location, you can hover over a point. Unfortunately the program Marijke used didn't seem to allow for including the number in addition to the location, but that data could be made available in later iterations depending on how we build it.

Link to Interactive Image

Link to Interactive Graph of where my Twitter friends are from

Next Steps

In an ideal world, everything that Marijke and I do manually would be automated - the location analysis would require some natural language processing and machine learning to find out what words map to which location and what is not actually a location.

We would also like to analyze some of the additional factors we found in the research that lead to more homophilous networks, and the best source of this data would be Facebook over Twitter. While both rely on self reporting, Facebook has data inputs for these factors whereas Twitter only provides a limited description and an optional input for location. While Twitter’s description text is an interesting resource, it’s put through the filter of the person writing it. Data gender, education, etc through Facebook however is less nebulous.

So the interaction model would look like this:

  1. The user logs into an application - Facebook, mobile, some form factor to be determined - with Facebook

  2. Data about the user’s friends is scraped to account for location, race/ethnicity, age (from birthday), education, occupation, and gender.

  3. The user’s profile picture would be copied and put through the program to be edited into the low poly format.

  4. The user would get their picture along with a breakdown of their data explaining what it all means and how it affects the way they learn about the world around them.

We had simplified this stage of the project to analyze just location in order to understand how to calculate the echo score on a factor that was more quantitative and would not introduce bias (although we did do so in the manual sorting of the data), but we hope to scale that up to include the six factors we didn’t get to in order to create a more accurate picture. We think that this is a machine learning problem to parse and sort the data, and the calculations would definitely need more statistical analysis than either of us are familiar with.  
 

As for where this project is going in the future, funnily enough Marijke found someone who did a project about estimating ideological positions based on Twitter data on the day we’re presenting this to the class - he wrote an R file that analyzes political alignment based on politicians followed on Twitter, which is a pretty similar research question to the one we asked in developing this project. We are also applying to the Data Science Incubator at NYU’s Data Science Center - there are a lot of engineering aspects that we need help on that we think the incubator would be best suited to helping us with. 


The biggest takeaway was seeing how small our exposure to the world is even when we think we’re well informed. By no means will I start reading conservative news sources regularly, but seeing my own features blurred by the program almost to the point of being unrecognizable was a big surprise. I’m curious to see how other factors will affect my results in the future. Although Twitter isn’t the data source we want in the future, the words we found in the self descriptions and the locations revealed a lot about the people we follow. Somewhere in those words, we were able to find the pieces of ourselves too.

Sources

Social Media Deepens Partisan Divides. But Not Always

http://www.nytimes.com/2014/11/21/upshot/social-media-deepens-partisan-divides-but-not-always.html?_r=1

 

Birds of a Feather

http://tuvalu.santafe.edu/~aaronc/courses/5352/readings/McPherson_Smith-Lovin_Cook_01_BirdsOfAFeature_HomophilyInSocialNetworks.pdf

 

Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data

http://pan.oxfordjournals.org/content/23/1/76.full.pdf?keytype=ref&ijkey=uMFPw4dsMHM7608

 

Homophily and Latent Attribute Inference

https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/viewFile/4713/5013

 

How Diverse is Facebook?

https://m.facebook.com/notes/facebook-data-science/how-diverse-is-facebook/205925658858

 

Using Lists to Measure Homophily on Twitter

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.302.4452&rep=rep1&type=pdf

 

A Robust Gender Inference Model

http://firstmonday.org/ojs/index.php/fm/article/view/5216/4113

 

Preference, Homophily, and Social Learning

http://pages.stern.nyu.edu/~ilobel/PreferencesHomophily.pdf

 

Dan Shiffman, Word Counting

http://shiffman.github.io/A2Z-F15/week5/notes.html

 

Why Buzzfeed is Trying to Shift It Strategy

http://mobile.nytimes.com/2014/08/13/upshot/why-buzzfeed-is-trying-to-shift-its-strategy.html?referrer=

Final Project Pitch

I have two ideas, both of which I really love but one that I think would be really interesting in terms of text analysis

The first is to keep building on my "But No Black Widow Movie" Twitter bot. I would like to increase the complexity of generated tweets; currently the tweets follow the format of "A <male Marvel character> <movie type> announced for Phase <number>, #butnoblackwidowmovie?". What I'm thinking about adding is: 

A <Marvel character> sequel featuring <Marvel character> 

<Marvel character or group> teams up with <Marvel character or group> 

And any other variants there of. I also wanted to utilize the WTF Engine to do some fun text generation. 

My other idea that I'm thinking about more strongly is in collaboration with someone in my data visualization class; we were interested in the diversity of viewpoints in the content people consume on Twitter - does social media create an echo chamber of homogenous opinion/perspective? We do this by creating a low polygon image of a person that's colored based on the variety of topics/perspectives they are exposed to on Twitter. 

The user will take a selfie and give us their Twitter handle - based on that, we'll pull their list of 'friends' (people they follow according to the Twitter API). We'll pull their text descriptions and strip out the common words to get only the keywords and hashtags they use. We'll then classify those words (manually or programmatically) and generate a Voronoi diagram of their interests. 

The selfie the user takes will be converted to greyscale and then tesselated into a low poly version. We'll overlay the generated Voronoi diagram on top to get color values; the more colors, the more complex and diverse the network of viewpoints the user is exposed to.

R-n't I Glad This isn't a Final Project Requirement

I respect what R can do, but this is the first programming language I've run into that didn't immediately make sense. It's very different from what I've used previously, so it took some time.

The Marvel API is back to working order, so I scraped it for the entire list of characters in the Marvel comic universe. I used R on one of the files I extracted as a sample and was able to read it in, but when I tried to analyze the data, I ran into a bunch of nulls.

It turns out that I did have a header in the file (which I discovered after experimentation), and that allowed me to actually run functions on the data

R is good for running on quantitative data, which will actually be useful for when I collect all the data I need. I think for the final project I want to find the relationship between Marvel's writers and characters, which means I have to run some queries when I finish pulling data from the API. I'll need to experiment a little more with R to really understand what's going on, but I'm coming around... maybe...