Transparent Text Symposium Liveblog – Day 2
1:43:10pm: A comment from the panel – Americans listen to podcasts because they don’t like to read. Might miss part of the picture – people read faster than they can speak, but you can’t multitask while reading, which might be the real reason some may prefer to listen.
12:33:45pm: Example 5: Analysis of the phrases used by newspapers. “Unborn child” more likely to be used by conservative papers. The use of words over time in the context of defense animation is pretty damn cool.
12:20:54pm: Kevin Quinn, Prof of Law at Berkeley, talking about visualizing political speech. Main thought: statistical models can be used to improve visual displays of data. They become increasingly useful for visualization as data size and complexity increases. This is especially true for textual data.
A statistical model is an assumption about how the text was generated – ie about the family of probability distributions that generated the observed data (typically indexed by certain parameters). There may be some background data which leads you to believe what the parameters may be.
Example: senate speech data from ’97 to ’04. Over 118k speeches. Speeches were stemmed, generating 4k unique stems. 74m stem observations. Looking at frequency of unigrams (single words). Agglomerative clustering generates 42 topics into which speeches were classified (eg, symbolic speech, international affairs, procedure, constitutional, etc). [but does it account for one speech having multiple topics?]
Example 2: political positions of newspapers. Looked at Ed columns of newspapers to see their stance on SCOTUS opinions. Then they place the stances of justices on a spectrum together with newspapers. NYT is more liberal than all of the justices. But Thomas and Scalia are more right than any newspaper.
Example 3: Invocation of commonsense. More conservative papers are likely to invoke common sense when describing SCOTUS opinions.
Example 4: Show which newspapers include a selected quote from a case, with x-axis being the political position of the newspaper.
11:56:21am: Emily Calhoun from MAPLight.org. Nonprofit, funded by the Sunlight Foundation (among others). Berkeley-based. Small staff. They do a mashup of financial and voting data (sourced from Center for Responsive Politics, Natioanl Institute on Money in State Politics, LA City Ethics Commission, GovTrack.us, and CA Leg Counsel). They have a nice, clean graph embedding feature so you can show a graph you dynamically generated on your website.
They have an interest in analysing the substance of bills.
11:31:11am: To implement the crowdsourcing project – fairly low costs – 1 week of developer time. Couple days for a designer. 450k+ of documents available for processing. Entertaining case study.
11:26:43am: Now showing crowdsourcing the processing of MP receipts, claim/expense forms. I used to do this stuff when I was working part time as an undergrad – processing application forms, indexing and data entry. This is the same thing, except that anyone who feels like doing it can log on to the website and do it. There are data quality issues, but there are ways around that.
11:21:11am: Come on, call a spade a spade… it’s a graph of MP spending. Sure it’s a “data visualization” of it, but speak like the person on the street and call it a graph. Oh, it doesn’t sound sexy? Get over it. Using “graph” will also help you fit into Twitter’s character limit as well.
11:20:06am: Simon Rogers, The Guardian (UK). MPs’ expenses – how to crowdsource 400k documents.
Showing infographic of government spending broken down by department and functions. Guardian created the Data Store, where they upload spreadsheets to Google Docs. It’s a collecting point for info. The Guardian manually copies data from paper sheets and turn it into graphs.
Case study about analysis of MP allowances claims scandal.
11:07:41am: Building Watson: Overview of the DeepQA Project. Training a computer to play Jeopardy. There’s a big challenge to parse natural language to understand what a question is asking.
9:22:48am: Building Watson: Overview of the DeepQA Project. Training a computer to play Jeopardy. There’s a big challenge to parse natural language to understand what a question is asking.
8:58:21am: Color popularity on the web – differs by TLCC domains.
8:56:25am: 95k+ documents, 2gb of federal legislation taken (10 word snips), generates 28gb of index = lots of repetition. Lots of overlapping language… no more analysis done
8:47:55am: Semantic Super Computing: Finding the Big Picture in Text Documetns (Daniel Gruhl, IBM Almaden Research Center in CA). Basic processing flow: PBs worth, then create 10x metadata as there is the source docs. Then index everything. Creating metadata dynamically: as more is generated, it may alter the metadata previously generated as more context is understood.
8:33:17am: Evaluation of clustering method is amazing. Looks to create an appreciable difference in clustering.
8:20:57am: Automated methods for conceptualising text. Computer-assisted conceptualization is arguably more effective. Using a classification based approach, specifically cluster analysis (simultaneously creating categories and assigning documents to categories). Humans can’t conduct cluster analysis manually very well.
Bell number = number of ways of partioning n objects. Bell(2) = 2,f(3) = 5, f(5) = 52, f(100) ~ 10^28 * number of elementary particles in the universe. So, the goal of creating an optimal application-independent cluster analysis method is mathematically impossible.
We could create a long list, but too hard to process and pick the best clustering unless we organize the list first. To organize, we develop a conceptual “geography” of clusterings.
First, code text as numbers. Second, apply all clustering methods we can find to the data within 15 mins. (This produces too many results for a person to understand.) Third, develop an application-independent distance metric between clusterings, a metric space of clusterings and a 2D projection. Fourth, local cluster ensemble, then create animated visualization, which then generates a smaller list which is comprehensible. We can then pick what’s best for us.
8:08:39am: And we’re back. Next up is “Quantitative Discovery of Qualitative Information : A General Purpose Document Clustering Methodology”. Gary King, from Harvard, is presenting.
7:43:23am: Break.
7:40:08am: Design case studies by thincdesign.
7:36:14am: Brain Movies. Shows brain activity when viewing movies.
7:29:31am: Unicode. There is no such thing as plain text, “avoiding mojibake”. Advice is to use Unicode (UTF-8 to be specific) encoding for text files. Telling us about character sets. Be explicit. [Eg, in HTML: <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />]
7:23:37am: Transparent Control Systems. Purpose of text transparency is to “provide daylight on governance system. But governance systems are really control systems. If we want to effect change, it will be in teh control systems, not the data.” [Law as Code Lessig reference?] Can break control systems down to make them easier to understand. Example using an engine piston.
7:17:50am: VUE (Visual Understanding Environment). Pulls out tags generated by OpenCalais into a Viewmap. Seasr integrated into VUE to add things (notes and metadata) to nodes. Can load XML, CSV, RSS files. Rapid visualization of structured data?
7:12:39am: Dido. David Karger at the MIT Haystack group. Text is easier to write in than structured data. To incentivize people to write structured data. Contains in-page visualizations.
7:10:36am: Linked Open Data. Manipulating linked sets of data. \”What was each senator\’s vote on the economic stimulus along with the population and median income of the state he or she represents?\” No easy way of doing this (not even Wolfram Alpha yet :). Showing Guanxi, which is used to discover connections. Showing Audrey, which is about discovering and sharing news stories (like Google Reader but with a stronger discovery/recommendation service via social networking component).
7:02:38am: Data Portraits. They think that presenting data in a picture is more interesting (may or may not be unintentionally trivializing their message here). 1. Showing Fernanda’s Themail project (email analysis program). Xobni has a similar feature (but Xobni’s not as pretty). 2. Lexigraphs, showing most used words in a person’s twitter stream + most recent words used – words are placed in outline of person as art. 3. Mycrocosm. People keep track of things themselves using graphs. Reminds me of a basic version of Feltron’s Annual Report. 4. Personas.
6:57:09am: We Meddle. Identifying the linear nature of Twitter as a limitation. Showing We Meddle interface – filtering system – bigger text for people “closer” to you. But can reverse if interested in seeing what people more distant from you are saying (since you probably speak to people closer to you in person more often).
6:52:26am: Twittermood. A display of US mood swings. Another Twitter analyzer. Showing NYT interactive infographic showing word prominence during the Superbowl. Trying to track US “happiness”. [Kind of presumptuous to label Twitter users as indicative of the US! But anyway.] Just flashed up an amusing graphic showing that people tend to be happy in Central Park. Users Twitter gardenhose stream (~1.8m messages per day).
Conference Twitter stream still outrageous, why do I need 10 people telling me the same thing every 300 seconds?
6:47:55am: Bengali Intellectuals in the Age of Decolonization. Showing architecture of the project – SMIL file and gazetteer info gets fed into a database then output to pages. Slice and dice video clips. Concepts using Open Calais (tagging) and commenting + geographic and chronological searching. Lecture Capture system tie in to generate transcripts. Ok, seems like a metadata system built on some of video media. Tufts project.
6:43:44am: Picturing to Learn. Students drawing indicates their understanding/misunderstanding of scientific concepts (in the examples shown, chemistry concepts). Now tracking and tagging scans of these drawings. Looking for help.
6:38:48am: NY Times R&D. Emerging trends: web of documents is being transformed. Pages are being disaggregated into their component pieces of data and repackaged. What does it mean for NYT to be a media brand in a disaggregated world? Freebase. NYT likes to collaborate around visualization with other (design) organizations. Cloud computing: shifting from one multi-purpose computer to many multi-purpose computing devices. This means that data, content and text has to “get smarter” – needs to know how it’s being displayed and who’s watching it, and the context in which it’s being displyed. Focus on device-independent media (for print and screen).
6:33:39am: Transparency and Exploratory Search. Something about search and metadata.
6:28:43am: nFluent. They’re an IBM research group dealing with translation. [I saw them demo this yesterday – it’s like Google Translate, but they are building up their translation engine in a slightly different way to Google – they are determined it’s uniquely different, but I’m not so gung ho about that.] Their translation engine translates pages as they load in real time and people can highlight incorrectly translated words, and offer correct translations. Crowdsourcing approach. They have some interesting trends – analysing user contributions: 1% are “superusers” (contributing 30% of the data, by words translated), 65% are “refiners”, 33% of consumers (normally 90% in other contexts). Gong didn’t go off, excellent!
6:23:22am: Day 2, Ignite-style presentations (5 minute presentations about what people/companies are doing). They’re using a gong to cut short presentations. My side comments in square brackets.
Web Ecology Project. Analysis of data on social networks (Twitter, Friendfeed, etc).
Transparent News Articles via Semantic Models. Parses news articles through a browser plug-in. User can highlight words to find out more about a topic. Eg, highlight “Oracle’s $11.1 bn acquisition of PeopleSoft” brings up a context menu where you can select, “What happened to this acquisition?” which then brings up, inline, a timeline of the relevant M&A activity. Clicking on the timeline loads up a relevant article. Uses an engine they developed called Brussels for parsing.
MassDOT Developers. Speaker from Mass Exec Office of Transportation. MassDOT brings together several Mass agencies. Now moving to open data. If you wanted to make a transport app [eg, an iPhone app] then you would have to scrape the data and then worry about IP issues. MassDOT is opening up their data (from various agencies) to third party developers. [Cityrail take note!] Within a month, three transport apps were developed. Eg, MassTransit, To a T, RMV Wait Times. These apps wouldn’t have been developed otherwise. [I have no idea why Cityrail wants to close their data, other than it might expose how late their trains always run.] Challenge is extracting data, such as GPS data for buses, from systems which weren’t built with this in mind (ie exporting to XML or other accessible formats). Figuring out when a bus arrives is “lifechanging”, especially when the weather in winter is -10 degrees… Fahrenheit.
Day 2.