When David Kirkby and Daniel Margala entered a contest to find out who could most accurately map dark matter in the universe, the first algorithm they submitted failed to crack the top 10 among a field of more than 100 engineers and data scientists taking part in the nearly three-month competition.
Kirkby, a cosmology professor at the University of California at Irvine said.
"It"s quite intimidating to see a long list of competitors. It's a good motivator to see how you're stacking up."
Kirkby and Margala, a grad student at UC Irvine, kept tweaking and re-submitting their statistical model — 15 more times in fact — before emerging as the winner of the Mapping Dark Matter competition sponsored by NASA and the Royal Astronomical Society and hosted by Kaggle.com, a crowd-sourcing platform for data modeling and prediction competitions that is in the process of raising venture capital.
The competition worked like this:
Contestants were provided with a data set to analyze — simulated images of 100,000 galaxies — and asked to create statistical models that could be used to measure the tiny distortions in galaxy images as their light is bent by dark matter. Dark matter is the still-mysterious substance scientists believe makes up about a quarter of the universe but which is largely invisible.
The results were tabulated by comparing the predictive models to actual data that had been withheld from the participants. In essence, contestants were trying to fill in the missing pieces of the puzzle as accurately as possible. Contestants submitted a total of 760 entries, with most of them submitting numerous times as they tried to make their models more accurate.
Kirkby and Margala’s team named themselves “DeepZot” — a merger of Deep Thought, the fictional computer from Douglas Adams’ Hitchhiker’s Guide to the Galaxy, and “Zot!” the battle cry of the UC Irvine mascot Peter the Anteater.
Their key insight was that they could come up with increasingly accurate statistical models by developing an artificial neural network — basically, a computer brain — and teaching it to discern patterns in the galaxy images. Without that neural network, the best the team could have done was eighth place, said Margala.
The final days of the contest period got the competitive juices flowing, thanks to the way it’s structured. Participants were able to watch a real-time leaderboard that would change with each new entry, so that the team in the top spot constantly checked to make sure they weren’t supplanted by a better predictive model.
Another participant, Martin O’Leary, a glaciologist at Cambridge University in the U.K., used his experience analyzing satellite imagery of glaciers to develop an algorithm that put him atop the leaderboard early in the competition — and merited a post on the White House’s Office of Science and Technology Policy blog that said O’Leary had “crafted an algorithm that outperformed the state-of-the-art algorithms most commonly used in astronomy for mapping dark matter.”
He said.
"One morning I checked and saw someone had beaten me. Your competitive instincts definitely kick in."
As the hours until the contest deadline wound down, Kirkby, with DeepZot in first place, was on a backpacking trip in the Sierra Mountains. The first thing he checked when he got back to his car was whether their submission still held down the top spot. It had.
The prize? An all-expenses-paid trip to a conference at the California Institute of Technology (Caltech) in nearby Pasadena, where the winners will present their solutions to NASA and other agencies, who will work to implement them.
Though some competitions hosted by Kaggle have sizable cash prizes, for many participants the prize is secondary to the chance to solve interesting problems that are structured in a clear, well-defined way.
O’Leary of the Kaggle contest said.
"It's something that people can easily compete on. There's lots of work that goes into taking a big problem and turning it into something manageable."
While Big Data has become a buzzword in the tech world, Kirkby says the most significant thing about such competitions isn’t raw computing power. The real significance, he says, is that they attract people with a wide variety of interests and skill sets — cosmologists who teach artificial brains to recognize patterns, glaciologists who look at galaxies and see similarities to huge chunks of ice — to chip away at problems outside their immediate expertise.
Kirkby said.
"Analysis of large data sets is a science that transcends any one problem. More brainpower is what's being leveraged by competitions like this. It's not just about access to the best computers."
COMMENTARY: Kaggle is a perfectly suited for solving analytical and statistical problems. the competition format and prizes allows the widest participation among a galaxy of very talented computer science, mathematics, statistics and engineering wonderkind from academia and private industry. A most interesting crowdsourcing technique for solving the most unusual scientific problems. Now solve how much dark matter there is in the universe. Or, how about, how large is the universe.
The Dark Matter Problem
The universe isn't behaving. Or at least, that's the view of many of the world's leading scientists: the universe behaves as if there is far more matter than we can observe. And that's important, because it means either that vital scientific theories are wrong, or that there are whole new types of stuff that we haven't yet discovered.
Mapping Dark Matter is a image analysis competition whose aim is to encourage the development of new algorithms that can be applied to challenge of measuring the tiny distortions in galaxy images caused by dark matter.
The aim is to measure the shapes of galaxies to reconstruct the gravitational lensing signal in the presence of noise and a known Point Spread Function. The signal is a very small change in the galaxies’ ellipticity, an exactly circular galaxy image would be changed into an ellipse; however real galaxies are not circular.
The challenge is to measure the ellipticity of 100,000 simulated galaxies.
The data consists of :
Galaxy images, that are very noisy images of elliptical objects with a simple brightness profile. The galaxy images are convolved or smoothed with a kernel that would act to turn a single point into a blurry image. Part of the challenge is to attempt to remove or account for that blurring effect.
To help account for the blurring effect each galaxy image has a star image where we provide a pixelised version of the kernel that with which the galaxy image was convolved.
Participants are provided with 100,000 galaxy and star pairs. A participant should provide an estimate for the ellipticity for each galaxy.
-
The Dark Matter Competitors
73 teams competed in the Dark Matter competition, and the results are breathtaking. Within ten days, Martin O’Leary, a PhD student in glaciology from Cambridge University, made a breakthrough on the problem. His findings were then written up on the White House blog. O’Leary’s glaciology research involves detecting edges in glacier fronts from satellite images. This seems like an unlikely background, but it is illustrative of why competitions are successful: They encourage people who would normally focus on specific problems in one field to apply their techniques to analogous problems in new fields.
O’Leary didn’t remain in the lead for long. Within the next few days, Marius Cobzarenco, a graduate student in computer vision from University College London, overtook O’Leary’s progress. Less than a day later, a team made up of Eu Jin Lok, an Australian graduate student at Deloitte, and Ali Hassaine, a signature verification specialist from Qatar University, took the lead. This leapfrogging continued until cosmologists David Kirkby and Daniel Margala claimed the prize on August 18.
Leapfrogging is a consistent feature of data competitions. When a competitor makes a breakthrough, the knowledge of what is possible inspires others to repeat and improve on that breakthrough. New York University’s Arun Sundararajan calls this the ‘Roger Bannister effect,’ named after the first man to break the four-minute mile. Prior to Roger Bannister, the four-minute mile was thought to be medically impossible. For ten years, the world record for the mile was four minutes and one second, until Roger Bannister broke through the record in 1954. Six weeks later, John Landy broke it again, and it soon became normal for male middle-distance runners.
The Royal Astronomical Society’s Tom Kitching is delighted with the results of the competition:
"Every couple of years since 2005, cosmologists have come together to discuss how we crack the challenge of measuring the gravitational lensing effect. This year, however, is different. We are bringing together a collection of experts from fields as diverse as handwriting recognition to string theorists. In the few months since these competitions were launched, we have seen new methods tried, new research directions opened, and a factor 3 increase in the accuracy with which the gravitational lensing signal can be measured."
When asked what this means for the future of cosmological research, Kitching replied that ‘the meeting in Pasadena is the beginning of a new way of doing research development in cosmology, linking up diverse experts with cosmologists.’ The results are already speaking for themselves.
About Kaggle
Kaggle is an innovative solution for statistical/analytics outsourcing. Kaggle is the leading platform for data modeling and prediction competitions. Companies, governments and researchers present datasets and problems - the world's best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model.
The motivation behind Kaggle is simple.
First, it is almost never the case that any single organization has access to the advanced machine learning and statistical techniques that would allow them to extract maximum value from their data. Meanwhile, data scientists crave real-world data to develop and refine their techniques. Kaggle corrects this mismatch by offering companies a cost-effective way to harness the 'cognitive surplus' of the world's best data scientists.
Second, crowdsourced data modeling is particularly effective because there are any number of approaches that can be applied to any modeling problem. It is impossible to know at the outset which technique will be most effective. By exposing the problem to a wide audience, with different participants trying different techniques, competitions can very quickly have you reach the frontier of what's possible from a given dataset.
Our community of data scientists comprises thousands of PhDs from quantitative fields such as computer science, statistics, econometrics, maths and physics. They come from over 100 countries and 200 universities. In addition to the prize money and data, they use Kaggle to meet, network and collaborate with experts from related fields.
The result for our clients is cheaper, faster and more powerful analytics.
Kaggle is proud to have achieved extraordinary results that have outperformed betting markets and advanced the state of the art in HIV research and chess ratings.
How It Works
Recent Comments