Due to my father’s influence, I have a longstanding interest in the intersection of science, teaching, and technology. Dad has been exploring the concept of Big Data for a little while now, and when I saw a hearing on the topic scheduled for the House side, I decided to go.
As the cast of characters was getting assembled, I had time to reflect on the differences between a Senate hearing and a House hearing. The most obvious is that because House committees are so much larger than Senate committees, in a House hearing room, the Members of Congress sit in what I might think of as stadium or lecture hall seating with chairs on progressively higher levels. In most Senate hearing rooms, the senators sit in some kind of horseshoe arrangement, which would seem to facilitate interactions. The other difference was a Congresswoman who appeared in an iridescent pink plastic cowboy hat. Considering that her outfit complimented her headgear, I had to assume that the hat was not an impulsive choice. We just don’t have pink plastic cowboy hats on the Senate side.
The three speakers for the hearing were from IBM, North Carolina State, and NSF. The man from NSF was apparently a regular witness for this subcommittee, and he was greeted as an old friend by the Members. The man from NSF won over my heart by being the only person during the entire hearing who I noticed consistently and correctly used data as a plural word. It took me about nine months of my postdoc to learn that “data are” is correct and “data is” is not, and it is now so ingrained that I notice whenever someone does not share my hard won lesson. I accept that the only reason I understand the distinction is because of my scientific training, but I do think more of scientists who get it right.
In the opening statements by the Members, they pointed out that 90% of the data in the world were collected in the past two years. Those data are not always in a convenient format either, since email and video count as data. Enormous data sets can be harnessed to attack problems such as reducing traffic congestion, predicting natural disasters, and ensuring public security. Most recently, the pictures and videos taken by people on the street and cameras in the air were all used to get clear pictures of the Boston Marathon bombers. Big Data sets can be used to explore the galaxy or to advance the frontiers of medicine and other sciences. Inevitably with any new advance, there are a number of challenges with moving forward.
One of the Members, a self-professed data nerd, asked if the bottlenecks were in hardware or in workforce, and the answer was that there are bottlenecks in both areas.
In the next decade, over half of all the jobs in Science, Technology, Engineering, and Mathematics (STEM) will be in information technology, but that does not mean that anyone without a computer science major won’t get a job. The fundamental skills required for data analysis are common across the STEM disciplines: math and statistics are important, but so are communication skills and teamwork. Especially critical is the ability to apply the results to help solve a problem. Indeed, the biggest target for training is to build master’s degree programs since they teach how to understand the results and impacts of an analysis beyond the basic information technology skills. People with doctorate degrees will be developing tools for analysis, but it will be the people who have masters degrees who will develop the skills to use those tools most effectively.
STEM education was a big part of President Obama’s budget, and there is an ongoing understanding that one of the biggest challenges is what is known as a leaky pipeline. Particularly in middle school and high school, students lose their interest in science and mathematics and choose other careers instead. The pipeline for women leaks at every level; more women than men leave and pursue other careers instead, so finding ways to fix these leaks is a vital long term strategy for ensuring that there is a well-trained workforce available to fill these jobs.
When dealing with hardware, the challenges of computation (performing operations) and curation or management of data are considered to be two sides of the same coin. These issues get into the concept of exascale computing, which relates to the power of the system. For perspective, our biggest fastest computers currently work at the petascale level, and exascale is 1000 times faster than that. There is currently a bill in the House and will soon be a companion bill in the Senate requesting funding to build an exascale computer to ensure that the US retains its competitive advantage in this area.
When the speakers discussed hardware challenges, I realized that mentally, I simply imagined a computer in a larger box, or a smaller box, or a differently shaped box. The man from IBM, however, obviously mentally went inside the box and pointed to specific systems that were preventing the computer from working any faster. He talked about heat dissipation, the need for enormous amounts of electricity, and several other details that flew by too quickly for me to catch, but he could obviously nail down the specific systems that needed work. He went on to explain that advances tend to occur when a very big and very important project hits a mission critical bottleneck caused by the inability of a computer to perform a specific function or at a specific speed. If the project is important enough, then resources are thrown at the problem until someone comes up with a solution. Inevitably there turn out to be numerous other applications for that solution, but it requires that initial strong pressure to make progress.
Last year, President Obama announced a Big Data Analytics initiative, so we can expect to hear more on this topic. As an indication of how quickly the field is moving, my word processor did not recognize the word “petascale” much less “exascale.” Or perhaps it is simply pouting because it doesn’t have that amount of power.