O'Reilly Radar
Top stories: January 30-February 3, 2012
Here's a look at the top stories published across O'Reilly sites this week.
What is Apache Hadoop?
Apache Hadoop has been the driving force behind the growth of the big data industry. But what does it do, and why do you need all its strangely-named friends? (Related: Hadoop creator Doug Cutting on why Hadoop caught on.)
Embracing the chaos of data
Data scientists, it's time to welcome errors and uncertainty into your data projects. In this interview, Jetpac CTO Pete Warden discusses the advantages of unstructured data.
Moneyball for software engineering, part 2
A look at the "Moneyball"-style metrics and techniques managers can employ to get the most out of their software teams.
With GOV.UK, British government redefines the online government platform
A new beta .gov website in Britain is open source, mobile friendly, platform agnostic, and open for feedback.
When will Apple mainstream mobile payments?
David Sims parses the latest iPhone / near-field-communication rumors and considers the impact of Apple's (theoretical) entrance into the mobile payment space.
Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.
Publishing News: B&N closes doors on Amazon Publishing
Here are a few of the stories that caught my attention this week in the publishing space.
Barnes & Noble puts its foot down on AmazonLast week, Amazon teamed up with Houghton Mifflin Harcourt to print and distribute the Amazon Publishing East Coast's adult titles under a new imprint, New Harvest. Some speculated the move might get Amazon through the brick-and-mortar doors of B&N. This week, B&N made it clear that not only would HMH's New Harvest imprint not make it in the door, but that no Amazon Publishing title would. In a post for the New York Times, Julie Bosman quoted from a statement made by Jaime Carey, B&N's chief merchandising officer:
"Our decision is based on Amazon's continued push for exclusivity with publishers, agents and the authors they represent. These exclusives have prohibited us from offering certain e-books to our customers. Their actions have undermined the industry as a whole and have prevented millions of customers from having access to content. It's clear to us that Amazon has proven they would not be a good publishing partner to Barnes & Noble as they continue to pull content off the market for their own self interest."
O'Reilly's general manager and publisher Joe Wikert called on B&N this week to disrupt the industry — maybe this is its first move. Bosman also took a look at B&N's position in the industry and its importance to the publishing ecosystem, especially in the face of a competitor like Amazon. Jordan Weissmann at The Atlantic mulled the prospects of Amazon killing publishing and argued: "In a financial arms race, publishers simply can't beat Amazon's arsenal." codeMantra collectionPoint 3.0 — Compose it; convert it; package it; distribute it; track it; re-price it; control your digital book workflow and metadata from one platform with collectionPoint 3.0, now available
Breaking up is hard to do
Amazon had issues with a social networking partner this week as well. As of Monday, Goodreads no longer displayed book data from the Amazon Product Advertising API, opting instead to move its data partnership to the Ingram Book Company. A Goodread's representative told Laura Hazard Owen that "the [API license agreement] terms now required by Amazon have become so restrictive that it makes better business sense to work with other data sources." Owen outlined some of the specifics on the restrictions:
"Amazon requires sites that use its API to link that content back to the Amazon site exclusively — so a book page on Goodreads would have to link only to its product page on Amazon and not to any other source or retailer ... Amazon also does not allow any content from its API to be used on mobile sites and apps."
Jon Mitchell at ReadWriteWeb took a deeper look into the situation — and explained why Goodreads will survive its breakup with Amazon.
The news caused some readers to worry about their cultivated Goodreads bookshelves. GalleyCat detailed potential data issues and offered up a Goodreads link that allows users to check on the state of their shelves to see if any tidying up is necessary.
Jonathan Franzen waxes absurd on ebooksThere's no shortage of things slated to be destroying society, and this week, author Jonathan Franzen added ebooks to the list. The Telegraph quoted Franzen speaking at a book festival in Cartagena, Colombia:
"I think, for serious readers, a sense of permanence has always been part of the experience. Everything else in your life is fluid, but here is this text that doesn't change. Will there still be readers 50 years from now who feel that way? Who have that hunger for something permanent and unalterable? I don’t have a crystal ball. But I do fear that it's going to be very hard to make the world work if there's no permanence like that."
Chenda Ngak at CBS's techt@lk took offense at Franzen's remarks, stating: "Even if I agree with him, as a book lover, his statements are too condescending to take seriously." Jonathan Segura at NPR chimed in as well, calling Franzen's comments "absurd" and pleading that we "get past the e-books versus print books thing." Segura's final comment pretty much summed up the overarching sentiment:
"We should worry less about how people get their books and — say it with me now! — just be glad that people are reading."
Photo (top): Kiftsgate Court, Chipping Campden, Gloucestershire - No Entry - sign by ell brown, on Flickr
Photo (bottom): Broken Kindle by kodomut, on Flickr
Related:
- We're in the midst of a restructuring of the publishing universe (don't panic)
- Hating Amazon is not a strategy
- Coming soon to a location near you: The Amazon Store?
- Open Question: Is it realistic for publishers to cut Amazon out of the equation?
- More Publishing Week in Review coverage
Visualization of the Week: Mapping Mexico's drug war
Diego Valle-Jones has created a powerful interactive map of the ongoing drug war in Mexico.
The interactive map lets you compare homicides and drug-related homicides, with the option to examine marijuana, opium, and drug-lab-related homicides. If you click on a bubble, you can see the number of murders over time, dating back to 2004. Important events are highlighted on that time line. You can also draw a shape on the map to look at a particular region.
Click to see the full interactive version of "Map of the Drug War in Mexico."
Valle-Jones writes:
"To unclutter the map and following the lead of the paper Trafficking Networks and the Mexican Drug War by Melissa Dell, I decided to only show the optimal highways (according to my own data and Google Directions) to reach the US border ports from the municipalities with the highest drug plant eradication between 1994 and 2003 and the highest 2d density estimate of drug labs based on newspaper reports of seizures. The map is a work in progress and is still missing the cocaine routes, but hopefully I'll be able to add them shortly."
The data can be exported to CSV, and the source code is available on Github.
Found a great visualization? Tell us about itThis post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.Save 20% on registration with the code RADAR20
More Visualizations:
- Politicians' word counts
- Visualizing SOPA tweets
- Visualizing your friends' Facebook likes
- AntiMap lets users capture and visualize their movements
- Mapping traffic casualties
- More Visualizations of the Week
Makers and hackers: The Where Conference is looking for you
The program for Where, our geolocation and mapping conference, is almost complete. Now we're looking for makers, hackers, developers, and DIYers to bring awesomeness to the 2012 Where Conference (April 2-4 in San Francisco).
There are three ways to participate.
1. Share an amazing geo/location/data visualization video or imageGeodata is often best expressed visually. Inspired by projects like Cab Spotting, Dave Imus' The Essential Geography Of The United States Of America and Eric Fisher's Locals and Tourists, we want your data viz videos, imagery and cartography. (Be sure that you have rights to the underlying data and that you attribute it properly.)
2. Create an interactive RFID installationInspired by Mediamatic, each attendee will have an RFID tag that can be paired with our conference social network. If an attendee swipes his or her tag, you'll be able to:
- Fetch info about the owner of a swiped badge.
- Show the owner of a swiped badge where they are supposed to be next, according to their personal schedule.
- Send the owner of a swiped badge a message via the attendee directory.
- Make two owners of swiped badges contacts within the attendee directory.
The Where Mini Maker Faire will take place on Wednesday, April 4. We're interested in any hardware project that is in the geo/location/sensing space, particularly ones that feature:
- Kinect/Computer Vision Arduino/Lilypad/ADK Processing for Android
- Beagle Board/Panda Board
- NFC/RFID
- Gadgeteer Wearables
- ROBOTS!!!!
Mini Maker Faire setup includes a four-inch skirted, countertop-level table, Wi-Fi and power.
Acceptances will be rolling. The deadline to get your proposal in is March 1, so apply soon. If your project is accepted for any of the above, you'll receive a pass to Where.
Where Conference 2012 — O'Reilly's Where Conference, being held April 2-4 in San Francisco, is where the people working on and using location technologies explore emerging trends in software development, tools, business strategies and marketing.Save 20% on registration with the code RADAR20
Four short links: 3 February 2012
- Page Speed (Google Code) -- an open-source project started at Google to help developers optimize their web pages by applying web performance best practices. Page Speed started as an open-source browser extension, and is now deployed in third-party products such as Webpagetest.org, Show Slow and Google Webmaster Tools.
- What Commons Do We Wish For? (John Battelle) -- trying to understand what the Internet would look like if we don’t pay attention to our core shared values. Excellent piece from jbat, who is thinking and writing in preparation for another book.
- The Trouble with Popularity -- this blog post on StackOverflow does a great job of explaining why moderators are necessary, and why it's not in everyone's interest to give them what they want. Sad to see this come out just as Yahoo! continues to gut and fillet Flickr, which used to be the benchmark for all things community.
- The Ongoing Fight Against GPL Enforcement -- interesting! Software Freedom Conservancy, who have pursued several cases against manufacturers who ship GPLed code but do not release their source and modifications to it, have used busybox as a fulcrum for their GPL code release lever. Manufacturers may be attempting to replace busybox with non-GPLed code to take away the fulcrum. In other news, engineering metaphors are like a massless body at light speed before the bigbang: unknowable.
Strata Newsletter: February 2, 2012
"Bring me my sword.
"
—Dr. Hans Rosling
Top–of–the–List Thinking from Edd and Alistair
A great link from Alasdair Allan about real-time interaction tracking for iPad apps set me thinking about the future of software usability. How long will it be before somebody figures out how to enable the front camera to do eye–tracking as well?
Pretty soon, our devices will be able to measure not just our interactions with them, but also our emotional responses. You can bet there's a robot somewhere just about ready to do this.
But where does this take us? I'm somewhat wary of a world filled with sycophantic devices, cocky in their assurance that they can know me through measurement. In one episode of Buffy the Vampire Slayer—stay with me—a guy builds his perfect robot girlfriend, only to find he can't love somebody programmed to be perfect for him.
I think the next step is personality. When we theme the iPads of our future, we'll be selecting temperament as much as UI. (I know already I'll be heading for the "wry, sarcastic" theme.)
Perhaps Iain M. Banks has it right in his Culture books. The drones and Minds are a very plausible rendition of the direction user interface is taking us.
When you read your iPad, will your iPad read you?
Cheers,
Edd Dumbill & Alistair Croll
Chairs, Strata
Use code NEWS20 and save 20%.
Tracks include: Data Science, Business & Industry, Visualization & Interface, Hadoop & Big Data, Policy & Privacy, and Domain Data.
Not Smart or Rich—or alas, even Pretty
You know that big data has come firmly into the mainstream when Fast Company allows an "expert blogger" Daniel Rasmus to write deeply on the subject while name–checking Isaac Asimov and exploring it as an existential phenomenon. He writes, "As Big Data becomes the next great savior of business and humanity, we need to remain skeptical of its promises as well as its applications and aspirations."
Complexity Itself
Author Allen Downey can't really wait for the March release of his new title, Think Complexity, so he's releasing it bit by bit online. The book focuses on data and structure, Python programming, computational modeling, and nothing less than the very philosophy of science itself. In fact, he writes, "I think complexity is a 'new kind of science' not because it applies the tools of science to a new subject, but because it uses different tools, allows different kinds of work, and ultimately changes what we mean by 'science.'" As Downey himself admits, he's Probably Overthinking It.
Chaos Theories
"The heart of data science is designing instruments to turn signals from the real world into actionable information," says Jetpac CTO Pete Warden. "Fighting the data providers to give you those signals in a convenient form is a losing battle, so the key to success is getting comfortable with messy requirements and chaotic inputs. As an engineer, this can feel like a deal with the devil, as you have to accept error and uncertainty in your results. But the alternative is no results at all." In this conversation with Audrey Watters, Warden outlines different ways to embrace the chaos of raw data. Bonus? A great infographic.
Letter from Davos
Based in Switzerland last month, New York Times blogger Nick Bilton reports on what he terms the Davos Data Deluge, in a terrific piece from the Bits blog. EU countries are seemingly more concerned with the use of personal data than is the US, and the world economic forum released its Big Data, Big Impact report warning of dire consequences if data is misappropriated and compromised. Bilton writes: "As the World Economic Forum report says: 'Concerted action is needed by governments, development organizations and companies to ensure that this data helps the individuals and communities who create it.'"
Data Docking
On July 20, 1969, the Apollo 11 spacecraft landed on the moon, making history and creating heros of the astronauts aboard. On January 9, 2012, that story was retold, animated solely by data.
Stats R Sexy
Edd recently helped to judge a unique contest, one that challenged entrants to create applications for R in business. Strangely, the top two were both concerned with air travel; the rest of them travel far and wide. It's always fun to take a peek at how others approach cash-prize problems.
Hadoop Huzzah
Last week, we announced that Hadoop World would become part of the Strata NYC conference slated for this autumn. That doesn't stop us from exploring all things cloud–based in Santa Clara this month. In fact, we have tracks on the Future of Hadoop, Hadoop and Big Data tracks, and Hadoop and Big Data applied sessions. Come join us up in the aerie!
Visionaries
Producing nerd–tastic glasses that mimic the pocket-protector beauty of 1950s geekdom with their thick black frames and blocky forms is Warby Parker's specialty. The retro look, however, is the only old–fashioned thing about this forward–looking company. They donate one pair of glasses to a needy person for every pair purchased by a fortunate person and customers have the luxury of trying on five pair at home at a time. They're doing everything they can to be transparent to their customers and stake–holders, and to that end have created an infographic that allows anyone to peek at any aspect of their 2011 fiscal year, something we find both brave and beautiful.
Conf Confidential
Not everyone who wants to attend Strata will be able to, so we have a nice slate of free live webcasts planned for February that anyone can watch and learn from.
On Feb. 10 at 10amPST, digital business analyst Kord Davis presents on The Ethics of Big Data: Value Personas in Practice. Davis explains: "Value Personas offer an organization a means of capturing a common set of values which can be used to encourage organizational alignment, acceptable business practices and individual behaviors based on a common set of values. They help to identify shared values and create a vocabulary for explicit dialog, thereby reducing risk from misalignment and encouraging collaboration and innovation across working teams."
On Feb. 17 at 10amPST, 10gen director of product marketing Jared Rosoff gives an hour–long tutorial titled, MongoDB Schema Design: How to Think Non-Relational. The normal rules don't apply to MongoDB, Rosoff says, adding: "The simple fact that documents can represent rich, schema–free data structures means that we have a lot of viable alternatives to the standard, normalized, relational model. Not only that, MongoDB has several unique features, such as atomic updates and indexed array keys, that greatly influence the kinds of schemas that make sense."
And on Feb. 21 at 10amPST, scientist Maksim Tsvetovat leads a discussion on Analyzing Social Media Response to Elections. Tsvetovat promises, "In this webcast, I will present a simplified methodology that one could use as a jumping-off point to automated discourse analysis at scale—and apply to real-time data streams coming from electoral politics of the 2012 Republican primary." Sounds like one we won't want to miss.
Archived and Quantified
On Jan. 25, we hosted the Strata Santa Clara Preview – Towards the Quantified Society – as a free online conference. If you missed it, each of the disparate sessions by
John Wilbanks, Bitsy Bentley, Edd and Alistair, Paul Dix, and Noah Illinsky are patiently archived for your rare free moment.
Wot Is the Wat
Just Trust Us: Gary Bernhardt's CodeMash 2012 lightning talk is hilarious.
Dr. Rosling's Fact-Based World
Dr. Hans Rosling uses LEGO blocks and IKEA boxes to explain the world. He employs the specter of a washing machine to mark global dips below and above poverty. A TED favorite, he also happens to enjoy swallowing swords. With his Gapminder foundation, Rosling aims to "replace devastating myths with a fact-based worldview." Gapminder's mission statement continues: "Our method is to make data easy to understand. We are dedicated to innovate and spread new methods to make global development understandable, free of charge, without advertising. We want to let teachers, journalists and everyone else continue to freely use our tools, videos and presentations." Gapminder is always glad for donations, but you can enjoy the collective wisdom there free of charge.
Looking for more? Visit oreilly.com/data.
Share this newsletter:
- Emotional Rescue
Top-of-the-List Thinking from Edd and Alistair
Quick Bytes
Talking It Out
-
The Final Bit
Elite Sponsors
Strategic Sponsors
Meet experts online.
Take Control of iCloud
Feb. 3, 10am PT
-
Parallel R
MongoDB and PHP
Print: $19.99
Ebook: $14.99
The Information Diet
Print: $22.99
Ebook: $19.99
Getting Started with CouchDB
Print: $24.99
Related:
Developer Week in Review: Brother, can you spare $100 billion?
In the old days, when modems came in wooden boxes and dinosaurs ruled the earth, kids would go door to door selling cookies for Girl Scouts or magazine subscriptions to raise money for a school trip. These days, partially because of safety issues with kids out on the streets by themselves, it's usually the parents who end up bringing boxes of chocolate bars and cookie order sheets to work.
Long story short, my male-spawn's 4-H group is planning a service project to Dominica and is trying to offset the significant costs involved in getting down there with some tax-deductible donations. I thought I'd pin a notice up on the virtual bulletin board, so if you're curious, check out their video and other info. Consider yourself solicited ...
Taking stock of FacebookIn our continuing quest to be the last news outlet on the planet to report on breaking news, you might have heard that Facebook is now poised to launch a massive IPO, perhaps the largest in high-tech history. Expectations are that the company will settle in with a market cap of around 1x10^11 dollars once the stock launches midyear.
Information released in the IPO documents reveal that Facebook now has a mind-numbing 845 million users. To put that in perspective, if Facebook were a country, it would be the third largest in the world (behind China and India). The fourth largest (that would be the United States) would clock in at a measly 312 million. It's worth stepping back for a moment and considering the implications of that.
The Internet didn't really become publicly and commercially available until the early 1990s. If we're generous, the Internet has been around for 22 years. At the end of that short span, we find ourselves living in a world where a good portion of a billion people have voluntarily signed up with a single social networking site (albeit one of the first social networking sites). When I was working at MIT in the early '80s, there was a lot of discussion on the nascent mailing lists like HUMAN-NETS about how the ARPAnet might morph into a ubiquitous WORLDNET (nice name, shame it didn't stick), and what that would mean for society. Well, we're there, and the jury is still out on what a post-Internet society is going to look like. Odds are, it's going to involve a lot of Farmville, though.
I bet they picked it because it sounds HawaiianFans of the Lua scripting language got a big vote of support from Wikimedia this week, as it was chosen as the new template scripting language. Lua is best known as a scripting language inside of video games, though it also has the distinction of being the only non-native, non-JavaScript language allowed to execute on iOS devices.
The selection of Lua for such a high-profile application runs against the prevailing JavaScript current, which has been strengthened significantly in recent months by HTML5. Lua is fast and small, something that can't always be said for JavaScript. With it now set to live in the heart of wikis around the world, perhaps Lua's star is finally rising.
Natural abhors a non-copyrighted vacuumCopyrights are the appropriate way to protect source code, much more appropriate than patents on the things that the source code implements, at least in my opinion. But here's a philosophical question for you: Can you copyright an empty file?
AT&T certainly thought so, as it placed a copyright header in the /bin/true shell command file shipped with Unix, a file that (other than the copyright) was completely empty. Let's take a moment to consider this. If I use the "touch" command to create an empty file, I would be technically in violation of the copyright since it is textually identical, except for the copyright notice.
It's likely a good thing that AT&T never tried to claim a copyright violation on all the empty files around the world, it probably would have caused a divide-by-zero runtime exception at the USPTO and dumped core into the Potomac.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.Save 20% on registration with the code RADAR20 Got news?
Please send tips and leads here.
Related:
Strata Week: The Megaupload seizure and user data
Here are a few of the data stories that caught my attention this week.
Megaupload's seizure and questions about controlling user dataWhen the file-storage and sharing site Megaupload had its domain name seized, assets frozen and website shut down in mid-January, the U.S. Justice Department contended that the owners were operating a site dedicated to copyright infringement. But that posed a huge problem for those who were using Megaupload for the legitimate and legal storage of their files. As the EFF noted, these users weren't given any notice of the seizure, nor were they given an opportunity to retrieve their data.
Moreover, it seemed this week that those users would have all their data deleted, as Megaupload would no longer be able to pay its server fees.
While it appears that users have won a two-week reprieve before any deletion actually occurs, the incident does raise a number of questions about users' data rights and control in the cloud. Specifically: What happens to user data when a file hosting / cloud provider goes under? And how much time and notice should users have to reclaim their data?
This is what you see when you visit Megaupload.com.
The financial news and information company Bloomberg opened its market data distribution interface this week. The BLPAPI is available under a free-use license at open.bloomberg.com. According to the press release, some 100,000 people already use the BLPAPI, but with this week's announcement, the interface will be more broadly available.
The company introduced its Bloomberg Open Symbology back in 2009, a move to provide an alternative to some of the proprietary systems for identifying securities (particularly those services offered by Bloomberg's competitor Thomson Reuters). This week's opening of the BLPAPI is a similar gesture, one that the company says is part of its "Open Market Data Initiative, an ongoing effort to embrace and promote open solutions for the financial services industry."
The BLPAPI works with a range of programming languages, including Java, C, C++, .NET, COM and Perl. But while the interface itself is free to use, the content is not.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.Save 20% on registration with the code RADAR20
Pentaho moves Kettle to the Apache 2.0 license
Pentaho's extract-transform-load technology Pentaho Kettle is being moved to the Apache License, Version 2.0. Kettle was previously available under the GNU Lesser General Public License (LGPL).
By moving to the Apache license, Pentaho says it will be more in line with the licensing of Hadoop, Hbase, and a number of NoSQL projects.
Kettle downloads and documentation are available at the Pentaho Big Data Community Home.
Oscar screeners and movie piracy dataAndy Baio took a look at some of the data surrounding piracy and the Oscar screening process. There has long been concern that the review copies of movies distributed to members of the Academy of Motion Arts and Sciences were making their way online. Baio observed that while a record number of films have been nominated for Oscars this year (37), just eight of the "screeners" have been leaked online, "a record low that continues the downward trend from last year."
However, while the number of screeners available online has diminished, almost all of the nominated films (34) had already been leaked online. "If the goal of blocking leaks is to keep the films off the Internet, then the MPAA [Motion Picture Association of America] still has a long way to go," Baio wrote.
Baio has a number of additional observations about these leaks (and he also made the full data dump available for others to examine). But as the MPAA and others are making arguments (and helping pen related legislation) to crack down on Internet privacy, a good look at piracy trends seems particularly important.
Got data news?Feel free to email me.
Related:
Commerce Weekly: The return of iPhone NFC rumors
Here are some things that caught my eye in the news this week.
When will Apple mainstream mobile payments?Now that everyone's iPhone 4S has a few dings on it and we've all grown bored flirting with Siri, our curiosity naturally turns to iPhone 5 and what gifts it will bequeath on mankind. Rumors of NFC (near-field communication, which lets phones pay with wireless technology), are at the forefront again, just as they were before the 4S arrived. As far back as August 2010, when Apple hired NFC expert Benjamin Vigier as its product manager for mobile commerce, expectations have been high that the next iPhone would include wireless payment. That was two versions ago; we must be getting close.
Seth Weintraub wrote this week on 9to5mac that a developer he met at MacWorld was building NFC into the next version of his app because Apple's iOS engineers are "heavy into NFC." Over on Fast Company, Austin Carr looked for clues in his conversation with Ed McLaughlin, who leads emerging payments at MasterCard. When Carr pressed McLaughlin for details on which handset makers were developing phones that work with MasterCard's contactless payment system, he didn't mention Apple by name but said he "didn't know of any handset maker out there who wasn't working to make their phones PayPass ready."
Why do we read these tea leaves? There are a few other NFC phones out there already, pushing the far end of the envelope. But Apple is much more significant, as Carr points out, thanks to its:
"... magical ability to transform whole industries. No one paid for music digitally before Apple unveiled iTunes; virtually no one listened to MP3 players, or carried smartphones, or played with tablets before Apple entered the markets."
Even more so than with previous trends, an enormous captive audience awaits the moment when Apple will introduce it to mobile payments. Scot Wingo notes, in a very good summary of the state of mobile commerce on Seeking Alpha, that Apple has "something like 250 million credit cards on file" in the iTunes store. Although only a fraction of those will buy the iPhone 5 in its first months out, they are sure to be customers who are already comfortable buying things through Apple's interface.
I think the biggest and best surprise will be more than just the date when iPhones ship with NFC, but rather how Apple presents a mobile wallet interface. When you think of how iTunes presented a better way to buy digital music, and when you compare the customer experience in Apple's retail stores with what you find almost anywhere else, you have to acknowledge Apple's genius in what we might call the transaction interface. Its programming efforts up front seem as likely to mainstream mobile commerce as any programming that it does behind the scenes to make those transactions occur.
X.commerce harnesses the technologies of eBay, PayPal and Magento to create the first end-to-end multi-channel commerce technology platform. Our vision is to enable merchants of every size, service providers and developers to thrive in a marketplace where in-store, online, mobile and social selling are all mission critical to business success. Learn more at x.com.
What PayPal is learning at the point of sale
PayPal's point-of-sale (POS) trial with 51 Home Depot stores is rolling out to Office Depot stores, too — cautiously, according to this Reuters story, which quotes an Office Depot executive saying "there are still some rough spots in that experience." The executive didn't say whether those rough spots had to do with the technology, the way customers are using it, or just the basic unfamiliarity with it. Regardless, the novelty presents something of an opportunity for PayPal, says Anuj Nayar, PayPal's chief spokesperson. "Retailers are not technologists by nature," Nayar told me in a conversation last week. "They have to work and sell in this multi-channel environment, where increasingly the differentiator is based on technology." But keeping up with the evolving technology shouldn't be the retailer's job, Nayar says. PayPal, of course, wants to provide a commercial ecosystem — as Nayar calls it, "a one-stop tech partner for retail."
PayPal had those capabilities on display at the National Retail Federation show last month, showing the various ways it is enabling payment at the point of sale. PayPal aspires to go beyond the concept of a mobile wallet in a phone; it wants to offer a "wallet in the cloud" that lets consumers make purchases with just their mobile number and a PIN — no card or phone needed. No doubt, the trials at Home Depot will shed light on just how comfortable consumers are with this idea. So far, Nayar says, it's too early in the trial to share any of those learnings.
Nayar did share a finding from PayPal's conversations with consumers and retailers about how they want to use mobile commerce: You need to get beyond not only the friction that keeps people from using technology, but also guard against any social stigma that could arise. "For example, when I go to get coffee in the morning, if I get there and see there is a 20-minute wait, I can't wait for that. That retailer has lost a customer because of a friction point. So how do you reduce that friction? Maybe it's giving people the ability to order the coffee over their mobile before they get there? ... But we tested that, and you know what we found? People don't like to jump the line. They didn't like the idea of coming in and looking to everyone in line like they were getting to skip the line. So, maybe you need a separate line and register, a PayPal Express line or something."
In other words, we want convenience, but not at the expense of looking like we're getting special treatment. No doubt, PayPal will learn more in the coming trials, which are ramping up quickly: The company wants to be at 2,000 points of sale by the end of March.
Square hits the hustingsSquare picked up a fresh round of publicity this week when word broke that staffers from both the Obama and Romney campaigns were using its plug-in dongle card reader to collect political donations for their candidates.
Obama campaign spokesperson Katie Hogan told Nick Bilton of The New York Times that the dongles were being shipped out to campaign workers across the country. The Obama campaign also hopes to create a donation app that works in conjunction with Square dongles so that any supporter can collect contributions with or without the support of the local campaign organization. All donations would obviously go to the campaign — minus the 2.75% transaction fee that Square keeps from every transaction.
The Romney campaign's digital director Zac Moffatt said the Republicans would also begin using Square as soon as this week, but he cautioned they want to make sure that using Square doesn't break any rules. "The challenge on this sort of thing is never with the technology, it's with the compliance. We're making sure everything we're doing follows fund-raising rules and is compliant with the FEC."
Although DC is generally slow to embrace new technologies, I have a hunch that tech that makes it easier for candidates to collect money will find a swift and warm welcome.
Got news?News tips and suggestions are always welcome, so please send them along.
If you're interested in learning more about the commerce space, check out DevZone on x.com, a collaboration between O'Reilly and X.commerce.
Related:
- Near field communication: What it is and how it works
- Google juices its Wallet
- Should Square lose the square?
- PayPal expands Home Depot trial
- More Commerce Weekly coverage
What is Apache Hadoop?
#bestiary { float: right; margin: 3px 0 10px 10px;
width: 280px; background: #eee; padding-left: 1.5em;
padding-right: 0.5em; padding-top: 1.5em; padding-bottom:
1em; border: 1px solid #999; }
#bestiary h3 { text-align: center; margin-top: 0; }
#bestiary td { font-size: 13px; vertical-align: top; padding-right:
1em; padding-bottom: 0.5em;}
Apache Hadoop has been
the driving force behind the growth of the big data industry. You'll
hear it mentioned often, along with associated technologies such as
Hive and Pig. But what does it do, and why do you need all its
strangely-named friends, such as Oozie, Zookeeper and Flume?
Hadoop brings the ability to cheaply process large amounts of
data, regardless of its structure. By large, we mean from 10-100
gigabytes and above. How is this different from what went before?
Existing enterprise data warehouses and relational databases excel
at processing structured data and can store massive amounts of
data, though at a cost: This requirement for structure restricts the kinds of
data that can be processed, and it imposes an inertia that makes
data warehouses unsuited for agile exploration of massive
heterogenous data. The amount of effort required to warehouse data
often means that valuable data sources in organizations are never
mined. This is where Hadoop can make a big difference.
This article examines the components of the Hadoop ecosystem and
explains the functions of each.
The core of Hadoop: MapReduce
Created at
Google in response to the problem of creating web search
indexes, the MapReduce framework is the powerhouse behind most of
today's big data processing. In addition to Hadoop, you'll find
MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB.
The important innovation of MapReduce is the ability to take a query
over a dataset, divide it, and run it in parallel over multiple
nodes. Distributing the computation solves the issue of data too large to fit
onto a single machine. Combine this technique with commodity Linux
servers and you have a cost-effective alternative to massive
computing arrays.
At its core, Hadoop is an open source MapReduce
implementation. Funded by Yahoo, it emerged in 2006 and,
href="http://research.yahoo.com/files/cutting.pdf">according to its
creator Doug Cutting, reached "web scale" capability in early
2008.
As the Hadoop project matured, it acquired further components to enhance
its usability and functionality. The name "Hadoop" has
come to represent this entire ecosystem. There are parallels
with the emergence of Linux: The name refers strictly to the Linux
kernel, but it has gained acceptance as referring to a complete
operating system.
Hadoop's lower levels: HDFS and MapReduce
Above, we discussed the ability of MapReduce to distribute
computation over multiple servers. For that computation to take
place, each server must have access to the data. This is the role of
HDFS, the Hadoop Distributed File System.
HDFS and MapReduce are robust. Servers in a Hadoop cluster can
fail and not abort the computation process. HDFS ensures data is
replicated with redundancy across the cluster. On completion of a
calculation, a node will write its results back into HDFS.
There are no restrictions on the data that HDFS stores. Data may
be unstructured and schemaless. By contrast, relational databases
require that data be structured and schemas be defined before storing
the data. With HDFS, making sense of the data is the responsibility
of the developer's code.
Programming Hadoop at the MapReduce level is a case of working with the
Java APIs, and manually loading data files into HDFS.
Improving programmability: Pig and Hive
Working directly with Java APIs can be tedious and error prone.
It also restricts usage of Hadoop to Java programmers. Hadoop offers
two solutions for making Hadoop programming easier.
- Pig is a programming
language that simplifies the common tasks of working with Hadoop:
loading data, expressing transformations on the data, and storing
the final results. Pig's built-in operations can make sense of
semi-structured data, such as log files, and the language is
extensible using Java to add support for custom data types and
transformations. - Hive enables Hadoop
to operate as a data warehouse. It superimposes structure on data in HDFS
and then permits queries over the data using a familiar SQL-like
syntax. As with Pig, Hive's core capabilities are
extensible.
Choosing between Hive and Pig can be confusing. Hive
is more suitable for data warehousing tasks, with predominantly
static structure and the need for frequent analysis. Hive's closeness
to SQL makes it an ideal point of integration between Hadoop and
other business intelligence tools.
Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming
data flows for incorporation into larger applications. Pig is a
thinner layer over Hadoop than Hive, and its main advantage is to
drastically cut the amount of code needed compared to direct
use of Hadoop's Java APIs. As such, Pig's intended audience remains
primarily the software developer.
Improving data access: HBase, Sqoop and Flume
At its heart, Hadoop is a batch-oriented system. Data are loaded
into HDFS, processed, and then retrieved. This is somewhat of a
computing throwback, and often, interactive and random access to data
is required.
Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google's
href="http://research.google.com/archive/bigtable.html">BigTable,
the project's goal is to host billions of rows of data for rapid access.
MapReduce
can use HBase as both a source and a destination for its
computations, and Hive and Pig can be used in combination with
HBase.
In order to grant random access to the data, HBase does impose a
few restrictions: Hive performance with HBase is 4-5 times slower than with plain
HDFS, and the maximum amount of data you can store in HBase is approximately
a petabyte, versus HDFS' limit of over 30PB.
HBase is ill-suited to ad-hoc analytics and more appropriate for
integrating big data as part of a larger application. Use cases
include logging, counting and storing time-series data.
The Hadoop Bestiary
Ambari
Deployment, configuration and monitoring
Flume
Collection and import of log and event data
HBase
Column-oriented database scaling to billions of rows
HCatalog
Schema and data type sharing over Pig, Hive and MapReduce
HDFS
Distributed redundant file system for Hadoop
Hive
Data warehouse with SQL-like access
Mahout
Library of machine learning and data mining algorithms
MapReduce
Parallel computation on server clusters
Pig
High-level programming language for Hadoop computations
Oozie
Orchestration and workflow management
Sqoop
Imports data from relational databases
Whirr
Cloud-agnostic deployment of clusters
Zookeeper
Configuration management and coordination
Getting data in and out
Improved interoperability with the rest of the data world is
provided by
href="https://github.com/cloudera/sqoop/wiki">Sqoop and
href="https://cwiki.apache.org/FLUME/">Flume. Sqoop is a tool designed to import data from
relational databases into Hadoop, either directly into HDFS or into
Hive. Flume is designed to import streaming flows of log data
directly into HDFS.
Hive's SQL friendliness means that it can be used as a point of
integration with the vast universe of database tools capable of making
connections via JBDC or ODBC database drivers.
Coordination and workflow: Zookeeper and Oozie
With a growing family of services running as part of a Hadoop
cluster, there's a need for coordination and naming services. As
computing nodes can come and go, members of the cluster need
to synchronize with each other, know where to access services, and
know how they should be configured. This is the purpose of
href="http://zookeeper.apache.org/">Zookeeper.
Production systems utilizing Hadoop can often contain complex
pipelines of transformations, each with dependencies on each
other. For example, the arrival of a new batch of data will trigger
an import, which must then trigger recalculations in dependent
datasets. The Oozie
component provides features to manage the workflow and dependencies,
removing the need for developers to code custom solutions.
Management and deployment: Ambari and Whirr
One of the commonly added features incorporated into Hadoop by
distributors such as IBM and Microsoft is monitoring and
administration. Though in an early stage,
href="http://incubator.apache.org/ambari/">Ambari aims
to add these features to the core Hadoop project. Ambari is intended to help system
administrators deploy and configure Hadoop, upgrade clusters, and
monitor services. Through an API, it may be integrated with other
system management tools.
Though not strictly part of Hadoop,
href="http://whirr.apache.org/">Whirr is a highly complementary
component. It offers a way of running services, including Hadoop, on
cloud platforms. Whirr is cloud neutral and
currently supports the Amazon EC2 and Rackspace services.
Machine learning: Mahout
Every organization's data are diverse and particular
to their needs. However, there is much less diversity in the kinds of
analyses performed on that data. The
href="http://mahout.apache.org/">Mahout project is a library of
Hadoop implementations of common analytical computations. Use cases
include user collaborative filtering, user recommendations,
clustering and classification.
Using Hadoop
Normally, you will use Hadoop
href="http://radar.oreilly.com/2012/01/big-data-ecosystem.html">in
the form of a distribution. Much as with Linux before it,
vendors integrate and test the components of the Apache Hadoop
ecosystem and add in tools and administrative features of their
own.
Though not per se a distribution, a managed cloud installation
of Hadoop's MapReduce is also available through Amazon's Elastic
MapReduce service.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.
Save 20% on registration with the code RADAR20
Related:
- Big data market survey: Hadoop solutions
- Why Hadoop caught on
- Microsoft's plan for Hadoop and big data
- Hadoop: What it is, how it works, and what it can do
- Get started with Hadoop: From evaluation to your first production cluster
Four short links: 2 February 2012
- Beautiful Buttons for Bootstrap -- cute little button creator, with sliders for hue, saturation, and "puffiness".
- CMU iPad Course -- iTunes U has the video lectures for a CMU intro to iPad programming.
- Inspiring Matter -- the conference aims to bring together designers, scientists, artists and humanities people working with materials research and innovation to talk about how they work cross- or trans-disciplinarily, the challenges and tools they've found for working collaboratively, and the ways they find inspiration in their work with materials. London, April 2-3.
- Facebook's S-1 Filing (SEC) -- the Internets are now full of insights into Facebook's business, for example Lance Wiggs's observation that Facebook's daily user growth is slowing. While 6-10% growth per quarter feels like a lot when annualized, it is getting close to being a normal company. Facebook is running out of target market, and especially target market with pockets deep enough to be monetised. But I think that's the last piece of Facebook IPO analysis that I'll link to. Tech Giant IPOs are like Royal Weddings: the people act nice but you know it's a seething roiling pit of hate, greed, money, and desperation that goes on a bit too long so by the end you just want to put an angry chili-covered porcupine in everyone's anus and set them all on fire. But perhaps I'm jaded.
Tools of Change for Publishing Newsletter: February 1, 2012
With TOC New York just over a week away, we sat down with O'Reilly online managing editor Mac Slocum to chat about our goals and hopes for the event. Here's a small snippet of our conversation.
Mac asked: What are you most looking forward to?
Kat: Everybody at O'Reilly who is even tangentially involved with TOC knows that I'm very excited about LeVar Burton. I think that he appeals to people who are either into Star Trek the Next Generation or Reading Rainbow. He's bringing the love to the geeks on the reading side and on the science side.
Joe: Our theme, the Change/Forward/Fast model. It's all about experimentation. Doing things in a quick manner, identifying what's working and what's not and reinvesting in what's working. That's a mantra at O'Reilly and we want to project it onto TOC as well. It's also an extension of an area you're going to see a lot of at TOC, and that's the agile development space. . . . It's all about getting the minimum viable product, and getting it out there quickly, and getting feedback from the audience rather than investing a whole bunch of money only to find out that the original product idea was off the mark.
Beautiful ebooks, data application, exploding publishing's borders, and other topics ensue as the discussion continues.
Cheers,
Kat Meyer and Joe Wikert
Chairs, Tools of Change
TOC New York 2012 Offers Unparalleled Networking
Over 60% of past attendees tell us they come to TOC New York to meet other people and network with their peers. And who doesn't love a good cocktail party? We've lined up some unique (and fun) events to get you meeting and greeting with one another, including a digital petting zoo, the Startup Showcase, and a not-to-be-missed party at the New York Public Library.
Hope to see you there.
Kat & Joe's Must–Reads
If You Got 'Em, Charge 'Em
A new Pew Internet survey shows that tablet ownership doubled–that is, rose by nearly 100%–during the winter holiday season. "Some 29%" of adult Americans are now thought to own at least one tablet. Meanwhile, across the pond, Amazon Kindle is said to be the UK's most unused Christmas present.
A survey conducted by MyVoucherCodes.co.uk claims that 48% of Brits received presents for Christmas 2011 that they have still yet to use. Of those, the Kindle comes out on top, with 53% of those queried admitting that they still hadn't downloaded any books to date. Mark Pearson, CEO of MyVoucherCodes, wonders: "It is surprising to see how many people have not used gifts they received almost one month ago; but I think we are all guilty of putting gifts to one side now and again. . . . but I must admit it's difficult to find a reason for people having not charged iPads!"
Penny Smart
The 99-cent special rose, dropped, and is again gaining traction with ebook pricing, according to a recent Wall Street Journal piece by Jeffrey Trachtenberg. He writes, "A growing number of publishers are experimenting with 99–cent temporary prices on e–books, in hopes of persuading readers to sample a wider range of authors."
Samsung Tab: Fail
"There is a reason," writes Dan Rowinski, "that Apple's iPad and Amazon's Kindle Fire are eating Samsung's tablet lunch." He offers a hint: "In Samsung's mind, the real competitor is Apple. Actual sales say that Amazon should be considered the primary threat."
Brick 'n' Mortar Without Borders
Writing recently on PaidContent, Laura Hazard Owen reports that Houghton Mifflin Harcourt will "publish the print versions of all of the adult titles from Amazon Publishing's New York-based division . . . and will distribute them everywhere in North America outside of Amazon.com. . . . Best of all from Amazon's point of view," Owen says, "Barnes & Noble will not get a penny from the e-book sales of Amazon Publishing titles."
Just the Faktz
Not everyone jerks a knee at the new Apple iBook EULA. Mike Elgan has a new essay out on why he thinks that the emotional response to Apple iBooks ignores the critical response. "What's strange about these emotional responses to Apple's legalese is that they fail the reality test," Elgan argues. "Apple's iBooks author terms are neither greedy nor evil; they don't mean Apple 'owns you'; and it's certainly not the worst thing Apple has ever done." See Joe's thinking about this topic below.
Striking the Balance
The tension surrounding the decision of whether an ebook is best used as an app, when to animate a book, and the costs of ruthless interactivity all occupy Theodore Gray. Gray previews his TOC keynote in this interview with Jenn Webb.
Best Weapons
Salon's Sandip Roy reports on the specter that Salman Rushdie cast on the recent Jaipur Literary Festival in India. And that's without even showing up. Gorgeously written ("Great crocodiles of school children in winter blazers. . . ."), Roy's essay considers how, even "in his absence, [Rushdie] hovered over the festival like Banquo's ghost. It was hard to find a session that didn't mention the man. Even the posters lining the entrance seemed reminders of the guest who did not come to dinner. One quoted Lyndon B. Johnson: 'A book is the best weapon against intolerance and ignorance.'"
Joe sometimes surprises us by unfolding his lanky frame from whichever airplane seat it happens to be in (the man travels a lot) to offer a contrarian opinion, one that is cogent but surprising. And so it is here in reaction to Apple's iBook EULA.
Appreciating Apple's Intent
Why all the fuss? Apple's intent has never been to improve the book publishing industry. Just like Amazon and any other ebook vendor, Apple's goal is to capture share of this rapidly growing segment. In Apple's case, they've simply decided to offer an authoring tool that's capable of creating some pretty darned cool products. If Amazon were to do the same thing and create a terrific authoring tool for .mobi or KF8 format would the industry be as upset? I don't think so.
How is this any different from the App Store model itself? Developers are creating apps for the App Store and they know they'll only run on an iOS device. They also realize they'll have to go through Apple's approval process before getting into the App Store.
Prior to the release of iBooks Author, the content creation and distribution model looked like this:
(1) Author writes material in favorite word processor.
(2) Author/publisher edits and converts that content into .mobi format for distribution on Amazon, EPUB format for distribution through iBookstore and others, etc.
The exact same model still exists today, even with the introduction of iBooks Author. That's right. Apple's EULA doesn't really lock you into their distribution channel for your content. . . . All they're really trying to do is prevent you from tweaking the output of their tool to create content for other distribution channels. OK, that's kind of annoying, but far from the lock–in nightmare so many people are describing it as.
Read on.
KF8 and iBooks Author: Up and Running
We are fortunate to have two late arrivals to the TOC speaking schedule, both of them from O'Reilly. Publishing tech specialist Adam Witwer and Sanders Kleinfeld, director of content and publishing services, are teaming up to present on brand-hot issue: KF8 and
iBooks Author.
Promising a one-stop crash course on these two topics, Sanders and Adam aim to address:
-
What is KF8, how does it differ from Amazon's .Mobi format, and what new features does it add for Kindle Fire?
-
How can publishers add new features for Kindle Fire while maintaining backward compatibility with Kindle eInk devices?
-
What features does the new iBooks Author platform offer to publishers, and what distinguishes it from existing ebook publishing tools like InDesign, oXygen, and Sigil?
-
How does the IBA format differ from EPUB? Should publishers start producing books in both formats? -
What are the implications of the iBooks Author licensing terms on authors and publishers?
-
What are the pros and cons of developing ebooks in KF8 and IBA formats?
-
How can publishers integrate these new formats into existing e-production workflows?
and the ever-popular MORE.
Audible Knowledge
The Latest from our TOC Podcast Series
Child's Play!
These are fine times for children, for parents, for writers, for illustrators, for animators, for game creators, for musicians, for actors, for translators and for app developers—to name a few. Indeed, the chain of folks now connected to a children's book destined for the iPad is far longer than the lowly two to three folks needed to create an old-fashioned kids' book on plain old paper. In this videocast, Joe speaks with WingedChariot's Neal Hoskins on the state of the children's digital book market.
Tucked far away on the Letters Of Note website—a delightful compendium of actual written letters—famed ad man David Ogilvy tries to explicate his method to a curious inquiry. Calling himself a "terrible copywriter," he details a long list of the research and avoidance techniques he employs to trick himself into writing some of the best ad copy ever penned. We're partial to this item:
"If all else fails, I drink half a bottle of rum and play a Handel oratorio on the gramophone. This generally produces an uncontrollable gush of copy."
Looking for more? Visit oreilly.com/toc.
-
Change/Fast/Forward
Hallway & Party Tracks
Hot Type
Publisher's Corner
TOC Spotlight
Audible Knowledge
Rum and Handel
Every Book Is a Startup
Ebook: $7.99
The Information Diet
Print: $22.99
Ebook: $19.99
Kindle Fire: The Missing Manual
NOOK Tablet: Out of the Box
Ebook: $2.99
Kindle Fire: Out of the Box
Ebook: $2.99
The Global eBook Market: Current Conditions & Future Projections
Breaking the Page: Preview Edition
Upcoming Events
Register now for these free, live O'Reilly webcasts.
Digital Bookmaking Tools Roundup #3
Feb. 23, 1pm PT
We hope to see you at these events.
TOC New York
Feb. 13-15, 2012
New York City
TOC Bologna
March 18, 2012
Bologna, Italy
Related:
Why Hadoop caught on
Doug Cutting (@cutting) is a founder of the Apache Hadoop project and an architect at Hadoop provider Cloudera. When Cutting expresses surprise at Hadoop's growth — as he does below — that carries a lot of weight.
In the following interview, Cutting explains why he's surprised at Hadoop's ascendance, and he looks at the factors that helped Hadoop catch on. He'll expand on some of these points during his Hadoop session at the upcoming Strata Conference.
Why do you think Hadoop has caught on?Doug Cutting: Hadoop is a technology whose time had come. As computer use has spread, institutions are generating vastly more data. While commodity hardware offers affordable raw storage and compute horsepower, before Hadoop, there was no commodity software to harness it. Without tools, useful data was simply discarded.
Open source is a methodology for commoditizing software. Google published its technological solutions, and the Hadoop community at Apache brought these to the rest of the world. Commodity hardware combined with the latent demand for data analysis formed the fuel that Hadoop ignited.
Are you surprised at its growth?Doug Cutting: Yes. I didn't expect Hadoop to become such a central component of data processing. I recognized that Google's techniques would be useful to other search engines and that open source was the best way to spread these techniques. But I did not realize how many other folks had big data problems nor how many of these Hadoop applied to.
What role do you see Hadoop playing in the near-term future of data science and big data?Doug Cutting: Hadoop is a central technology of big data and data science. HDFS is where folks store most of their data, and MapReduce is how they execute most of their analysis. There are some storage alternatives — for example, Cassandra and CouchDB, and useful computing alternatives, like S4, Giraph, etc. — but I don't see any of these replacing HDFS or MapReduce soon as the primary tools for big data.
Long term, we'll see. The ecosystem at Apache is a loosely-coupled set of separate projects. New components are regularly added to augment or replace incumbents. Such an ecosystem can survive the obsolescence of even its most central components.
In your Strata session description, you note that "Apache Hadoop forms the kernel of an operating system for big data." What else is in that operating system? How is that OS being put to use?Doug Cutting: Operating systems permit folks to share resources, managing permissions and allocations. The two primary resources are storage and computation. Hadoop provides scalable storage through HDFS and scalable computation through MapReduce. It supports authorization, authentication, permissions, quotas and other operating system features. So, narrowly speaking, Hadoop alone is an operating system.
But no one uses Hadoop alone. Rather, folks also use HBase, Hive, Pig, Flume, Sqoop and many other ecosystem components. So, just as folks refer to more than the Linux kernel when they say "Linux," folks often refer to the entire Hadoop ecosystem when they say "Hadoop." Apache BigTop combines many of these ecosystem projects together into a distribution, much like RHL and Ubuntu do for Linux.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.Save 20% on registration with the code RADAR20
- Big data market survey: Hadoop solutions
- Hadoop: What it is, how it works, and what it can do
- Get started with Hadoop: From evaluation to your first production cluster
Four short links: 1 February 2012
- Cycles of Invention and Commoditisation (Simon Wardley) -- Explosions of industrial creativity rarely follow the invention or discovery of a technology but instead its commoditisation i.e. it wasn't the discovery of electricity but Edison's introduction of utility services for electricity that produced the creative boom that led to recorded music, modern movies, consumer electronics and even Silicon Valley. However, utility provision of electricity did more than just create a new world, it disrupted existing industries (both directly and through reduced barriers of entry), it also allowed for new practices and methods of working to emerge and even resulted in new economic forms - such as Henry Ford's Fordism. This isn't a one off pattern. The cycle of invention/commoditisation repeats throughout our industrial history, following a surprisingly consistent pathway. Understanding this pattern is critical to anticipating the changes emerging in our industry today - whether that's the web, cloud computing or the future changes that 3D printing will bring. Simon explains the Business of the Internet in one blog post. Simon is king.
- Why Are Software Development Task Estimations Regularly Off By A Factor of 2 or 3? -- never a truer word spoken in parable.
- Using the Full-Screen API in Browsers (Mozilla) -- useful! The older I get, the more I like full-screen mode. I found myself wishing my email client had it, then someone pointed out that was called "mutt in a shell window". Fair 'nuff.
- File Formats in Javascript (GitHub) -- pointers to libraries for different file formats in Javascript.
With GOV.UK, British government redefines the online government platform
The British Government has launched a beta of its GOV.UK platform, testing a single domain for that could be used throughout government. The new single government domain will eventually replace Directgov, the UK government portal which launched back in 2004. GOV.UK is aimed squarely as delivering faster digital services to citizens through a much improved user interface at decreased cost.
Unfortunately, far too often .gov websites cost millions and don't deliver as needed. GOV.UK is open source, mobile-friendly, platform agnostic, uses HTML5, scalable, hosted in the cloud and open for feedback. Those criteria collectively embody the default for how government should approach their online efforts in the 21st century.
“Digital public services should be easy to find and simple to use - they must also be cost effective and SME-friendly," said Francis Maude, the British Minister for the Cabinet Office, in a prepared statement. "The beta release of a single domain takes us one step closer to this goal."
Tom Loosemore, deputy director of government digital service at UK Government, introduced the beta of GOV.UK at the Government Digital Service blog, including a great deal of context on its development and history. Over at the Financial Times Tech blog, Tim Bradshaw has published an excellent review of the GOV.UK beta.
As Bradshaw highlights, what's notable about the new beta is not just the site itself but the team and culture behind it: that of a large startup, not the more ponderous bureaucracy of Whitehall, the traditional "analogue" institution..
GOV.UK is a watershed in how government approaches Web design, both in terms of what you see online and how it was developed. The British team of developers, designers and managers behind the platform collaboratively built GOV.UK in-house using agile development and the kind of iterative processes one generally only sees in modern Web design shops. Given that this platform is designed to serve as a common online architecture for the government of the United Kingdom, that's meaningful.
“Our approach is changing," said Maude. "IT needs to be commissioned or rented, rather than procured in huge, expensive contracts of long duration. We are embracing new, cloud-based start-ups and enterprise companies and this will bring benefits for small and medium sized enterprises here in the UK and so contribute to growth.”
The designers of GOV.UK, in fact, specifically describe it as "government as a platform." It's code that others can build upon. It was open from the start, given that the new site was built using open source tools. The code behind GOV.UK was released as open source code on GitHub.
Things like code for @govuk being on github is a pretty big deal. Proper bold statement of a new kind of culture. Respect #gov20
— Dominic Campbell (@dominiccampbell) January 31, 2012"For me, this platform is all about putting the user needs first in the delivery of public services online in the UK," said Mike Bracken, executive director of government digital services. Bracken is the former director of digital development at the Guardian News and Media and was involved in setting up MySociety. "For too long, user need has been trumped by internal demands, existing technology choices and restrictive procurement practices. GOV.UK puts user need firmly in charge of all our digital thinking, and about time too."
The Gov.UK stackReached via email, Bracken explained more about the technology choices that have went into GOV.UK, including a platform diagram, below.
Why create an open source stack? "Why not?" asked Bracken."It's a government platform, and as such it belongs to us all and we want people to contribute and share in its development."
While many local, state and federal sites in the United States have chosen to adapt and use Wordpress or Drupal as open government platforms, the UK team started afresh.
"Much of the code is based on our earlier alpha, which we launched in May last year as an early prototype for a single platform," said Bracken. "We learnt from the journey, and rewrote some key components recently, one key element of the prototype in scale."
According to Bracken, the budget for the beta is £1.7 million pounds, which they are running under at present. (By way of contrast, the open government reboot of FCC.gov was estimated to cost 1.35 million dollars.) There are about 40 developers coding on GOV.UK, said Bracken, but the entire Government Digital Service has around 120 staff, with up to 1800 external testers. They also used several external development houses to complement their team, some for only two weeks at a time.
Why build an entirely new open government platform? "It works," said Bracken. "It's inherently flexible, best of breed and completely modular. And it doesn't require any software licenses."
Bracken believes that the GOV.UK will give the British government agility, flexibility and freedom to change as they go, which are, as he noted, not characteristics aligned with the usual technology build in the UK -- or elsewhere, for that matter.
Given the British government's ambitious plans for open data, the GOV.UK platform also will need to be act as, well, a platform. On that count, they're still planning, not implementing.
"With regard to API's, our long term plan is to 'go wholesale,' by which we mean expose data and services via API's," said Bracken. "We are at the early stages of mapping out key attributes, particularly around identity services, so to be fair it's early days yet. The inherent flexibility does allow for us to accommodate future changes, but it would be premature to make substantial claims to back up API delivery at this point."
The GOV.UK platform will be adaptable for the purposes of city government as well, over time. "We aim to migrate key department sites onto it in the first period of migration, and then look at government agencies," said Bracken. "The migration, with over 400 domains to review, will take more than a year. We aim to offer various platform services which meet the needs of all Government service providers."
Making GOV.UK citizen-centricThe GOV.UK platform was also designed to be citizen-centric, keeping the tasks that people come to a government site to accomplish in mind. Its designers, apparently amply supplied with classic British humor, dubbed the engine that tracks them the "Needotron."
"We didn't just identify top needs," said Loosemore, via email. "We built a machine to manage them for us now and in the future. Currently there are 667!" Loosemore said that they've open sourced the Needotron code, for those interested in tracking needs of their own.
"There are some of the Top needs we've not got to properly yet," said Loosemore. "For example, job search is still sub-optimal, as is the stuff to do with losing your passport."
According to Loosemore, some the top needs that citizens have when they come to a site in the UK are determining the minimum wage, learning when the public and bank holidays are or when the clocks change for British Summer Time. They also come to central government to pay their council tax, which is actually a local function, but GOV.UK is designed to route those users to the correct site using geolocation.
This beta will have the top 1000 things you would need to do government, said Maude, speaking at the Sunlight Foundation this week. (If that's so, there's over 300 more yet to go.)
"There's massive change needed in our approach to how to digitize what we do," he said. "Instead of locking in with a massive supplier, we need to be thinking of it the other way around. What do people need from government? Work from the outside in and redesign processes."
In his comments, Maude emphasized the importance of citizen-centricity, with respect to interfaces. We don't need to educate people on how to use a service, he said. We need to educate government on how to serve the citizen.
"Like U.S., the U.K. has a huge budget deficit," he said. "The public expects to be able to transact with government in a cheap, easy way. This enables them to do it in a cheaper, easier way, with choices. It's not about cutting 10 or 20% from the cost but how to do it for 10 or 20% of the total cost."
The tech behind Gov.UKJames Stewart, who was the tech lead on the beta of GOV.UK, recently blogged about and browser support. He emailed me the following breakdown of the rest of the technology behind GOV.UK.
Hosting and Infrastructure:
- DNS hosted by Dyn.com
- Servers are Amazon EC2 instances running Ubuntu 10.04LTS
- Email (internal alerts) sending via Amazon SES and Gmail
- Miscellaneous file storage on Amazon S3
- Jetty application server
- Nginx, Apache and mod_passenger
- Jenkins continuous integration server
- Caching by Varnish
- Configuration management using Puppet
Front end
- Javascript uses jQuery, jQuery UI, Chosen, and a variety of other plugins
- Gill Sans, provided by fonts.com
- Google web font loader
Languages, Frameworks and Plugins
"Most of the application code is written in Ruby, running on a mixture of Rails and Sinatra," said Stewart. "Rails and Sinatra gave us the right balance of productivity and clean code, and were well known to the team we've assembled. We've used a range of gems along with these, full details of which can be found in the Gemfiles at Github.com/alphagov."
The router for GOV.UK is written in Scala and uses Scalatra for its internal API, said Stewart. "The router distributes requests to the appropriate backend apps, allowing us to keep individual apps very focused on a particular problem without exposing that to visitors," said Stewart. "We did a bake-off between a ruby implementation and a Scala implementation and were convinced that the Scala version was better able to handle the high level of concurrency this app will require."
Databases
- MongoDB. "We started out building everything using MySQL but moved to MongoDB as we realised how much of our content fitted its document-centric approach," said Stewart. "Over time we've been more and more impressed with it and expect to increase our usage of it in the future."
- MySQL, hosted using Amazon's RDS platform. "Some of the data we need to store is still essentially relational and we use MySQL to store that," said Stewart. "Amazon RDS takes away many of the scaling and resilience concerns we had with that, without requiring changes to our application code."
- MaPit geocoding and information service from mySociety. "MaPit not only does conventional geocoding, " said Stewart, in terms of determining what the given the longitude or latitude is for a postcode, but " italso gives us details of all the local government areas a postcode is in, which lets us point visitors to relevant local services."
Collaboration tools
- Campfire for team chat
- Google Apps
- MediaWiki
- Pivotal Tracker
- Many, many index cards.
Related:
- White House to open source Data.gov as open government platform
- New open data initiatives in Canada and Britain
- Government IT's quiet open source evolution
Embracing the chaos of data
A data scientist and a former Apple engineer, Pete Warden (@petewarden) is now the CTO of the new travel photography startup Jetpac. Warden will be a keynote speaker at the upcoming Strata Conference, where he'll explain why we should rethink our approach to data. Specifically, rather than pursue the perfection of structured information, Warden says we should instead embrace the chaos of unstructured data. He expands on that idea in the following interview.
What do you mean asking data scientists to embrace the chaos of data?Pete Warden: The heart of data science is designing instruments to turn signals from the real world into actionable information. Fighting the data providers to give you those signals in a convenient form is a losing battle, so the key to success is getting comfortable with messy requirements and chaotic inputs. As an engineer, this can feel like a deal with the devil, as you have to accept error and uncertainty in your results. But the alternative is no results at all.
Are we wasting time trying to make unstructured data structured?Pete Warden: Structured data is always better than unstructured, when you can get it. The trouble is that you can't get it. Most structured data is the result of years of effort, so it is only available with a lot of strings, either financial or through usage restrictions.
The first advantage of unstructured data is that it's widely available because the producers don't see much value in it. The second advantage is that because there's no "structuring" work required, there's usually a lot more of it, so you get much broader coverage.
A good comparison is Yahoo's highly-structured web directory versus Google's search index built on unstructured HTML soup. If you were looking for something that was covered by Yahoo, its listing was almost always superior, but there were so many possible searches that Google's broad coverage made it more useful. For example, I hear that 30% of search queries are "once in history" events — unique combinations of terms that never occur again.
Dealing with unstructured data puts the burden on the consuming application instead of the publisher of the information, so it's harder to get started, but the potential rewards are much greater.
How do you see data tools developing over the next few years? Will they become more accessible to more people?Pete Warden: One of the key trends is the emergence of open-source projects that deal with common patterns of unstructured input data. This is important because it allows one team to solve an unstructured-to-structured conversion problem once, and then the entire world can benefit from the same solution. For example, turning street addresses into latitude/longitude positions is a tough problem that involves a lot of fuzzy textual parsing, but open-source solutions are starting to emerge.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.
Save 20% on registration with the code RADAR20
Associated photo on home and category pages: "mess with graphviz by Toms Bauģis, on Flickr
Related:
- Big data market survey: Hadoop solutions
- Before you interrogate data, you must tame it
- Big data goes to work
Four short links: 31 January 2012
- The Sky is Rising -- TechDirt's Mike Masnick has written (and made available for free download) an excellent report on the entertainment industry's numbers and business models. Must read if you have an opinion on SOPA et al.
- Tennis Australia Exposes Match Analytics -- Served from IBM's US-based private cloud, the updated SlamTracker web application pulls together 39 million points of data collated from all four Grand Slam tournaments over the past seven years to provide insights into a player's style of play and progress. The analytics application also provides a player's likelihood of beating their opponent through each round of the two-week tournament and the 'key to the match' required for them to win. "We gave our data to IBM, said, 'Here we go, that's 10 years of scores and stats, matches and players'," said Samir Mahir, CIO at Tennis Australia. Data as way to engage fans. (via Steve O'Grady)
- Data Monday: Logins and Passwords (Luke Wroblewski) -- Password recovery is the number one request to help desks for intranets that don’t have single sign-on portal capabilities.
- QR Codes: Bad Idea or Terrible Idea? (Kevin Marks) -- People have a problem finding your URL. You post a QR Code. Now they have 2 problems. I prefer to think of QR codes as a prototype of what Matt Jones calls "the robot-readable world"--not so much the technology we really imagine we will be deploying when we build our science fictiony future.
Four short links: 30 January 2012
- Improvisation and Forgiveness (JP Rangaswami) -- what makes us human is not repetitive action. Human occupations should require human intellect, and there's no more human activity than making a judgement call when processes have failed a customer.
- Kinect Tech in Laptop Prototypes -- "waving your hands around at your laptop" will be the new "bellowing into your walkie-talkie phone". (via Greg Linden)
- Beautiful Web Type -- demo page for the best from Google's web fonts directory. Source on GitHub.
- Ethics of Brain Boosting, Discussion (Hacker News) -- this comment in particular: in my initial reckless period of self-experimentation, I managed to induce phosphenes by accident -- blue white flashes in the entire visual field, blanking out everything else. Both contacts were in the supraorbital region. I ceased my experiments for a while and returned to the literature. And you thought that typo where you accidentally took the database offline was bad ....
A discussion with David Farber: bandwidth, cyber security, and the obsolescence of the Internet
David Farber, a veteran of Internet technology and politics, dropped by Cambridge, Mass. today and was gracious enough to grant me some time in between his numerous meetings. On leave from Carnegie Mellon, Dave still intervenes in numerous policy discussions related to the Internet and "plays in Washington," as well as hosting the popular Interesting People mailing list. This list delves into dizzying levels of detail about technological issues, but I wanted to pump him for big ideas about where the Internet is headed, topics that don't make it to the list.
How long can the Internet last?
I'll start with the most far-reaching prediction: that Internet protocols simply aren't adequate for the changes in hardware and network use that will come up in a decade or so. Dave predicts that computers will be equipped with optical connections instead of pins for networking, and the volume of data transmitted will overwhelm routers, which at best have mixed optical/electrical switching. Sensor networks, smart electrical grids, and medical applications with genetic information could all increase network loads to terabits per second.
When routers evolve to handle terabit-per-second rates, packet-switching protocols will become obsolete. The speed of light is constant, so we'll have to rethink the fundamentals of digital networking.
I tossed in the common nostrum that packet-switching was the fundamental idea behind the Internet and its key advance over earlier networks, but Dave disagreed. He said lots of activities on the Internet reproduce circuit-like behavior, such as sessions at the TCP or Web application level. So theoretically we could re-architect the underlying protocols to fit what the hardware and the applications have to offer.
But he says his generation of programmers who developed the Internet are too tired ("It's been a tough fifteen or twenty years") and will have to pass the baton to a new group of young software engineers who can think as boldly and originally as the inventors of the Internet. He did not endorse any of the current attempts to design a new network, though.
Slaying the bandwidth bottleneck
Like most Internet activists, Dave bewailed the poor state of networking in the U.S. In advanced nations elsewhere, 100-megabit per second networking is available for reasonable costs, whereas here it's hard to go beyond a 30 megabits (on paper!) even at enormous prices and in major metropolitan areas. Furthermore, the current administration hasn't done much to improve the situation, even though candidate Obama made high bandwidth networking a part of his platform and FCC Chairman Julius Genachowski talks about it all the time.
Dave has been going to Washington on tech policy consultations for decades, and his impressions of the different administrations has a unique slant all its own. The Clinton administration really listened to staff who understood technology--Gore in particular was quite a technology junkie--and the administration's combination of judicious policy initiatives and benign neglect led to the explosion of the commercial Internet. The following Bush administration was famously indifferent to technology at best. The Obama administration lies somewhere in between in cluefulness, but despite their frequent plaudits for STEM and technological development, Dave senses that neither Obama nor Biden really have the drive to deal with and examine complex technical issues and insist on action where necessary.
I pointed out the U.S.'s particular geographic challenges--with a large, spread-out population making fiber expensive--and Dave countered that fiber to the home is not the best solution. In fact, he claims no company could make fiber pay unless it gained 75% of the local market. Instead, phone companies should string fiber to access points 100 meters or so from homes, and depend on old copper for the rest. This could deliver quite adequate bandwidth at a reasonable cost. Cable companies, he said, could also greatly increase Internet speeds.
Fixed wireless ISPs offer Internet access to thousands of communities, mostly rural ones with no other access except dial-up. These ISPs face interconnection problems because they are distrusted or ignored by the incumbents carriers. Mobile wireless companies are pretty crippled by loads that they encouraged (through the sale of app-heavy phones) and then had problems handling, and are busy trying to restrict users'bandwidth. But a combination of 4G, changes in protocols, and other innovations could improve their performance.
Waiting for the big breach
I mentioned that in the previous night's State of the Union address, Obama had made a vague reference to a href="http://www.whitehouse.gov/blog/2012/01/26/legislation-address-growing-danger-cyber-threats">cybersecurity initiative with a totally unpersuasive claim that it would protect us from attack. Dave retorted that nobody has a good definition of cybersecurity, but that this detail hasn't held back every agency with a stab at getting funds for it from putting forward a cybersecurity strategy. The Army, the Navy, Homeland Security, and others are all looking or new missions now that old ones are winding down, and cybersecurity fills the bill.
The key problem with cybersecurity is that it can't be imposed top-down, at least not on the Internet, which, in a common observation reiterated by Dave, was not designed with security in mind. If people use weak passwords (and given current password cracking speeds, just about any password is weak) and fall victim to phishing attacks, there's little we can do with dictats from the center. I made this point in an article twelve years ago. Dave also pointed out that viruses stay ahead of pattern-matching virus detection software.
Security will therefore need to be rethought drastically, as part of the new network that will replace the Internet. In the meantime, catastrophe could strike--and whoever is in the Administration at the time will have to face public wrath.
Odds without ends
We briefly discussed FCC regulation, where Farber tends to lean toward asking the government to forebear. He acknowledged the merits of arguments made by many Internet supporters, that the FCC tremendously weakened the chances for competition in 2002 when it classified cable Internet as a Title 1 service. This shielded the cable companies from regulations under a classification designed back in early Internet days to protect the mom-and-pop ISPs. And I pointed out that the cable companies have brazenly sued the FCC to win court rulings saying the companies can control traffic any way they choose. But Farber says there are still ways to bring in the FCC and other agencies, notably the Federal Trade commission, to enforce anti-trust laws, and that these agencies have been willing to act to shut down noxious behavior.
Dave and I shared other concerns about the general deterioration of modern infrastructure, affecting water, electricity, traffic, public transportation, and more. An amateur pilot, Dave knows some things about the air traffic systems that make one reluctant to fly. But there a few simple fixes. Commercial air flights are safe partly because pilots possess great sense and can land a plane even in the presence of confusing and conflicting information. On the other hand, Dave pointed out that mathematicians lack models to describe the complexity of such systems as our electrical grid. There are lots of areas for progress in data science.
Moneyball for software engineering, part 2
Brad Pitt was recently nominated for the Best Actor Oscar for his portrayal of Oakland A's general manager Billy Beane in the movie "Moneyball," which was also nominated for Best Picture. If you're not familiar with "Moneyball," the subject of the movie and the book on which it's based is Beane, who helped pioneer an approach to improve baseball team performance based on statistical analysis. Statistics were used to help find valuable players that other teams overlooked and to identify winning strategies for players and coaches that other teams missed.
Last October, when the movie was released, I wrote an article discussing how Moneyball-type statistical analysis can be applied to software teams. As a follow-up, and in honor of the recognition that the movie and Pitt are receiving, I thought it might be interesting to spend a little more time on this, focusing on the process and techniques that a manager could use to introduce metrics to a software team. If you are intrigued by the idea of tracking and studying metrics to help find ways to improve, the following are suggestions of how you can begin to apply these techniques.
Use the data that you haveThe first puzzle to solve is how to get metrics for your software team. There are all kinds of things you could "measure." For example you could track how many tasks each developer completes, the complexity of each task, the number of production bugs related to each feature, or the number of users added or lost. You could also measure less obvious activities or contributions, such as the number of times a developer gets directly involved in customer support issues or the number of times someone works after hours.
In the movie "Moneyball," there are lots of scenes showing complicated-looking statistical calculations, graphs, and equations, which makes you think that the statistics Billy Beane used were highly sophisticated and advanced. But in reality, most of the statistics that he used were very basic and were readily available to everyone else. The "innovation" was to examine the statistics more closely in order to discover key fundamentals that contributed to winning. Most teams, Beane realized, disregarded these fundamentals and failed to use them to find players with the appropriate skills. By focusing on these overlooked basics, the Oakland A's were able to gain a competitive edge.
To apply a similar technique to software teams, you don't need hard-to-gather data or complex metrics. You can start by using the data you have. In your project management system, you probably have data on the quantity and complexity of development tasks completed. In your bug-tracking and customer support systems, you probably have data on the quantity, rate, and severity of product issues. These systems also typically have simple reporting or export mechanisms that make the data easy to access and use.
Looking at this type of data and the trends from iteration to iteration is a good place to start. Most teams don't devote time to examining historical trends and various breakdowns of the data by individual and category. For each individual and the team as a whole, you can look at all your metrics, focusing, at least to start, on fundamentals like productivity and quality. This means examining the history of data captured in your project management, bug tracking, and customer support systems. As you accumulate data over time, you can also analyze how the more recent metrics compare to those in the past. Gathering and regularly examining fundamental engineering and quality metrics is the first step in developing a new process for improving your team.
Establish internal supportIf you saw the movie "Moneyball," you know that much was made of the fact that some of the old experienced veterans had a hard time getting on board with Beane and his new-fangled ideas. The fact that Beane had statistics to back-up his viewpoints didn't matter. The veterans didn't get it, and they didn't want to. They were comfortable relying on experience and the way things were already done. Some of them saw the new ideas and approaches — and the young guys who were touting them — as a threat.
If you start talking about gathering and showing metrics, especially individual metrics that might reveal how one person compares to another, some people will likely have concerns. One way to avoid serious backlash, of course, is to move slowly and gradually. The other thing you might do to decrease any negative reaction is to cultivate internal supporters. If you can get one, two, or a few team members on board with the idea of reviewing metrics regularly as a way to identify areas for improvement, they can be a big help to allay the fears of others, if such fears arise.
How do you garner support? Try sitting down individually with as many team members as possible to explain what you hope to do. It's the kind of discussion you might have informally over lunch. If you are the team manager, you'll want to explain carefully that historical analysis of metrics isn't designed to assign blame or grade performance. The goal is to spend more time examining the past in the hopes of finding incremental ways to improve. Rather than just reviewing the past through memory or anecdotes, you hope to get more accuracy and keener insights by examining the data.
After you talk with team members individually, you'll have a sense for who's supportive, who's ambivalent, and who's concerned. If you have a majority of people who are concerned and a lack of supporters, then you might want to rethink your plan and try to allay concerns before you even start.
Once you begin gathering and reviewing metrics as a team, it's a good idea to go back and check in periodically with both the supporters and the naysayers, either individually or in groups. You should get their reactions and input on suggestions to improve. If support is going down and concern is going up, then you'll need to make adjustments, or your use of metrics is headed for failure. If support is going up and concern is going down, then you are on the right track. Beane didn't get everyone to buy into his approach, but he did get enough internal supporters to give him a chance, and more were converted once they saw the results.
Codermetrics: Analytics for Improving Software Teams — This concise book introduces codermetrics, a clear and objective way to identify, analyze, and discuss the successes and failures of software engineers — not as part of a performance review, but as a way to make the team a more cohesive and productive unit Embed metrics in your processWhile there might be some benefit to gathering and looking at historical metrics on your own, to gain greater benefits, you'll want to share metrics with your team. The best time to share and review metrics is in meetings that are part of your regular review and planning process. You can simply add an extra element to those meetings to review metrics. Planning or review meetings that occur on a monthly or quarterly basis, for example, are appropriate forums. Too-frequent reviews, however, may become repetitive and wasteful. If you have biweekly review and planning meetings, for example, you might choose to review metrics every other meeting rather than every time.
To make the review of metrics effective and efficient, you can prepare the data for presentation, possibly summarizing key metrics into spreadsheets and graphs, or a small number of presentation slides (examples and resources for metric presentation can be found and shared at codermetrics.org). You will want to show your team summary data and individual breakdowns for the metrics gathered. For example, if you are looking at productivity metrics, then you might look at data such as:
- The number and complexity of tasks completed in each development iteration.
- A breakdown of task counts completed grouped by complexity.
- The total number of tasks and sum of complexity completed by each engineer.
- The trend of task counts and complexity over multiple development iterations.
- A comparison of the most recent iteration to the average, highs, and lows of past iterations.
To start, you are just trying to get the team in the habit of looking at recent and historical data more closely, and it's not necessary to have a specific intent defined. The goal, when you begin, is to just "see what you can see." Present the data, and then foster observations and open discussion. As the team examines its metrics, especially over the course of time, patterns may emerge. Individual and team ideas about what the metrics reveal and potential areas of improvement may form. Opinions about the usefulness of specific data or suggestions for new types of metrics may come out.
Software developers are smart people. They are problem-spotters and problem-solvers by nature. Looking at the data from what they have done and the various outcomes is like looking at the diagnostics of a software program. If problems or inefficiencies exist, it is likely the team or certain individuals will spot them. In the same way that engineers fix bugs or tune programs, as they more closely analyze their own metrics, they may identify ways to tune their performance for the better.
There's a montage in the middle of the movie "Moneyball" where Beane and his assistant are interacting with the baseball players. It's my favorite part of the movie. They are sharing their statistics-inspired ideas of how games are won and lost, and making small suggestions about how the players can improve. Albeit briefly, we see in the movie that the players themselves begin to figure it out. Beane, his assistant, the coaches and the players are all a team. Knowledge is found, shared, and internalized. As you incorporate metrics-review and metrics-analysis into your development activities, you may see a similar organic process of understanding and evolution take place.
Set short-term, reasonable goalsSmall improvements and adjustments can be significant. In baseball, one or two runs can be the difference between a win or a loss, and a few wins over the course of a long season can be the difference between second place and first. On a software team, a 2% productivity improvement equates to just 10 minutes "gained" per eight-hour workday, but that translates to an "extra" week of coding for every developer each year. The larger the team, the more those small improvements add up.
Once you begin to keep and review metrics regularly, the next step is to identify areas that you believe can be improved. There is no rush to do this. You might, for example, share and review metrics as a team for many months before you begin to discuss specific areas that might be improved. Over time, having reviewed their contributions and outcomes more closely, certain individuals may themselves begin to see ways to improve. For example, an engineer whose productivity is inconsistent may realize a way to become more consistent. Or the team may realize there are group goals they'd like to achieve. If, for example, regular examination makes everyone realize that the rate of new production bugs found matches or exceeds the rate of bugs being fixed, the team might decide they'd like to be more focused on turning that trend around.
It's fine — maybe even better — to target the easy wins to start. It gets the team going and allows you to test and demonstrate the potential usefulness of metrics in setting and achieving improvement goals. Later, you can extend and apply these techniques to other areas for different, and possibly more challenging, types of improvements.
When you have identified an area for improvement, either for an individual or a group, you can identify the associated metric and the target goal. Pick a reasonable goal, especially when you are first testing this process, remembering that small incremental improvements can still have significant effects. Once the goal is set, you can use your metrics to track progress month by month.
To summarize, the simple process for employing metrics to make improvements is:
- Gather and review historical metrics for a specific area.
- Set metrics-based goals for improvement in that area.
- Track the metrics at regular intervals to show progress toward the goal.
The other thing to keep in mind when getting started is that it's best to focus on goals that can be achieved quickly. Like any test case, you want to see results early to know if it's working. If you target areas that can show improvement in less than three months, for example, then you can evaluate more quickly whether utilizing metrics is helpful or not. If the process works, then these early and easier wins can help build support for longer-term experiments.
Take one metric at a timeIt pays to look at one metric at a time. Again, this is similar to tuning a software program. In that case, you instrument the code or implement other techniques to gather performance metrics and identify key bottlenecks. Once the improvable areas are identified, you work on them one at a time, tuning the code and then testing the results. When one area is completed, you move on to the next.
Focusing the team and individuals on one key metric and one area at a time allows everyone to apply their best effort to improve that area. As with anything else, if you give people too many goals, you run the risk of making it harder to achieve any of the goals, and you also make it harder to pinpoint the cause of failure should that occur.
If you are just starting with metrics, you might have the whole team focus on the same metric and goal. But over time you can have individuals working on different areas with separate metrics, as long as each person is focused on one area at a time. For example, some engineers might be working to improve their personal productivity while others are working to improve their quality.
Once an area is "tuned" and improvement goals are reached, you'll want to continue reviewing metrics to make sure you don't fall back. Then you can move on to something else.
Build on small successesLet's say that you begin reviewing metrics on production bug counts or development productivity per iteration; then you set some small improvement targets; and after a time, you reach and sustain those goals. Maybe, for example, you reduce a backlog of production bugs by 10%. Maybe this came through extra effort for a short period of time, but at the end, the team determines that metrics helped. Perhaps the metrics helped increase everyone's understanding of the "problem" and helped maintain a focus on the goal and results.
While this is a fairly trivial example, even a small success like this can help as a first step toward more. If you obtain increased support for metrics, and hopefully some proof of the value, then you are in a great position to gradually expand the metrics you gather and use.
In the long run, the areas that you can measure and analyze go well beyond the trivial. For example, you might expand beyond core software development tasks and skills, beyond productivity and quality, to begin to look at areas like innovation, communication skills, or poise under pressure. Clearly, measuring such areas takes much more thought and effort. To get there, you can build on small, incremental successes using metrics along the way. In so doing, you will not only be embedding metrics-driven analysis in your engineering process, but also in your software development culture. This can extend into other important areas, too, such as how you target and evaluate potential recruits.
Moneyball-type techniques are applicable to small and big software teams alike. They can apply in organizations that are highly successful as well as those just starting out. Bigger teams and larger organizations can sometimes afford to be less efficient, but most can't, and smaller teams certainly don't have this luxury. Beane's great success was making his organization highly competitive while spending far less money (hence the term "Moneyball"). To do this, his team had to be smarter and more efficient. It's a goal to which we can all aspire.
Jonathan Alexander looked at the connection between Moneyball and software teams in the following webcast:
Photo: scoreboard by popofatticus, on Flickr
Related:
- Moneyball for software engineering, part 1
- Data and baseball (and Brad Pitt)
- Process kills developer passion
