Regina Barzilay: Venturesome and Voracious for Data
Profile of a scientist, teacher, working parent and cancer survivor who speaks three languages, won the MacArthur “genius grant” and pushes the envelope of artificial intelligence
Computers and their programmers are data-hungry. And nobody is more data-hungry than a computer scientist working in artificial intelligence and its corollary, machine learning, the performance of which depends on huge amounts of such data. For those working in natural language processing -- the kind of programming that enables a human and a machine to communicate in words rather than the truncated language of codes or the deeper layers of numbers -- data becomes even more desirable.
Regina Barzilay, a professor of electrical engineering and computer science at the Massachusetts Institute of Technology, is a rising star in natural language processing. Last October Barzilay received a MacArthur Fellowship -- a “genius grant” -- for her contributions to computational linguistics. This added to a growing list of awards she has received, including a National Science Foundation Career Award and a Microsoft Faculty Fellowship.
Barzilay received her doctorate from Columbia University and joined the MIT faculty in 2003. Her dissertation focused on getting a computer to extract enough information from news reports to create an accurate and readable digest.
“You want to create lots of cases from which the machine can understand how you’re thinking,” Barzilay said. This is because machines need as many examples of language use, grammar, syntax, and semantics as they can get to infer linguistic rules and be able to talk back to a person intelligibly.
Barzilay’s interest in cutting-edge research on natural language processing may benefit from her personal linguistic adventures. Her first language was Russian; her second, Hebrew; and her third, English. Not only are these languages very different from each other structurally, but each also has its own alphabet.
Her linguistic experience also exemplifies her ability to adapt to circumstances, embrace challenges, and transcend adversity. Born in the eastern European landlocked country of Moldova, which was part of USSR in 1970, Barzilay grew up in a tightknit Jewish community. She was 19 when the Soviet Union broke up. The dissolution created dangerous political instability, which led many Moldavian Jews to emigrate to Israel. Barzilay’s family saw its social network melt away as other families departed. Barzilay remembers having to cross out more and more names in her telephone notebook as friends left. Eventually her parents decided to follow them, moving to the Tel Aviv area.
Barzilay went almost immediately to a kibbutz, where she alternated between working in an electrical factory and harvesting almonds. It was perfect timing for her first experience living apart from her parents and a rapid immersion in Israeli culture. Shortly thereafter she enrolled at Ben-Gurion University of the Negev.
During that period, she recalles she barely scraped by.
“I didn’t buy clothes for years,” she said. “I remember going once on a date and the guy bought me chocolate milk in a package, and I thought he must be really rich, to buy chocolate milk in a package.”
Barzilay earned a master’s degree in mathematics and then taught secondary school math. “It was fun the first year, all right the second year, but then each year I would be teaching from the same textbook,” she said. “I thought, not for me.” This prompted her to return to the university for a master’s degree in computer programming.
Barzilay entered the field of artificial intelligence just as one paradigm hit a wall and another emerged. The first approach to natural language processing, Barzilay said, was to try to translate everything humans understand about language into rules the machine could follow. “It didn’t work,” she said. “In the 1970s and 1980s the whole field was very symbolic, a lot of logic. I said, you know what? I don’t care. I’m going to use frequency” -- that is, numerical patterns of word use. These days natural language processing researchers are taking a more computational, statistical approach in which the machine itself has to identify patterns and infer the rules.
The process is not necessarily about handing the computer a complete guide, said Adam Fisch, an MIT doctoral student working with Barzilay. “A lot of machine learning systems take as input the raw words, and you put the onus on the machine to be able to learn the structure on its own,” Fisch said.
Although the method still doesn’t produce a computer with the brain power of, say, Hal in the film 2001: A Space Odyssey, we do have Siri and Alexa and may soon have self-driving cars. The searching and inferring that machines are currently capable of are also useful for boilerplate reports such as transaction summaries for financial institutions, Barzilay said. “They are very schematic. There is no prose there,” she said. “You fill in the template and you fill in the text. You don’t need to be a human.” On the other hand, she said, “If you are really trying to write a creative piece we are nowhere close.”
Barzilay, her colleagues and graduate students have set their sights on bigger challenges, including how a machine can answer questions with less than the ideal amount of data, which is often the case in real world problems. Humans -- and a few machines, such as IBM’s Watson program -- are pretty good at this, as when Wheel of Fortune viewers can grasp a full phrase when only a few letters have been filled in.
“You want to be able to reduce the amount of data you need to arrive at a solution,” Fisch said. One way to do that is to “share the information you’ve learned in a similar but different task to bootstrap your process for this new task,” he added. “If you can recognize the similarities between your new task and a task in the past, you can share that knowledge in order to not have to work as hard.”
Barzilay has used natural language processing to decipher the dead language Ugaritic to test whether a computer could find all the similarities between Ugaritic and a modern related language, Hebrew. Written in cuneiform, Ugaritic was last used in Syria circa 1200 B.C. It took linguists 12 years to decipher a Ugaritic tablet, but the computer needed only a few hours to correlate the letters the two languages have in common. However, Barzilay said ruefully, the available information about most other undeciphered languages is still too small to provide enough data for a computer to teach itself.
And there may be more pressing problems to solve. In 2014 Barzilay was diagnosed with breast cancer. In addition to the usual emotional trauma associated with a cancer diagnosis, Barzilay was appalled at the tiny amount of data physicians use to decide how to treat patients.
“I discovered that all the decisions today are based on the 3 percent of patients [who participate] in clinical trials,” she said. “I really wanted to know what happened to people like me, not based on six variables of comparison, but using all the wealth of information available” in medical records. Each time she went in for a treatment, she said, she asked, “Why are they doing this in such a primitive way? Not the treatment itself, but when I would ask questions they had no clue. We’re sitting on so much data.”
Barzilay assembled a group of researchers and clinicians to find out whether natural language processing could deepen and accelerate research that would improve patient outcomes. Most current research focuses on the biological and genetic mechanisms of cancer induction and progression, she found. “Obviously we should continue this research, but there are a lot of other things we should do. We are looking from one dimension, but there are different types of clues.”
These clues could come from patient data embedded in pathology reports, mammograms, biopsy slides, and even patient demographic information and general health status. Barzilay and colleagues collected 91,505 breast pathology reports from three local hospitals -- Massachusetts General, Brigham and Women's and Newton-Wellesley -- and hand-annotated a total of 17,136, developing 20 categories of information such as tumor characteristics and atypical cell types. Annotations are like the tags users can attach to computer files to enhance their searchability.
The researchers used their annotated dataset to train a computer to extract relevant data, then tested its skills on a separate set of 500 pathology reports. The program could parse, or recognize and pull out, the cancer types and tissue atypias with 90 percent accuracy for any given patient. This sort of information could give clinicians guidance in selecting treatment regimens. Their results were reported in the January 2017 issue of the journal Breast Cancer Research and Treatment. The researchers now have a parsed database of about 160,000 annotated reports.
Kevin Hughes, a surgical oncologist at Massachusetts General Hospital and associate professor at Harvard Medical School who worked on the research with Barzilay said, “Will it revolutionize pathology and oncology? I definitely think so.”
Although Hughes said he has not yet applied the database in clinical practice, he has “used the data to get a better sense of what the actual risk is” for atypical tissues and carcinoma variants. “Getting data out of medical records is very difficult,” he added. “We have electronic health records, which would make you think we have electronic data. What we have is electronic text documents.” Unless these can be transformed into something a computer can parse, they can’t be used to overcome the bottleneck in understanding represented by that 3 percent of patients in clinical trials. Major obstacles to progress include the difficulty of meshing databases from different institutions. Another issue is adhering to the requirements of HIPAA and other privacy laws that govern the ability of researchers to use patients’ personal information, Hughes said. The conflict between scientists’ thirst for data and patients’ right to privacy remains unresolved.
Hughes has found Barzilay to be a good colleague and said she is an excellent mentor for her students. Barzilay co-teaches the popular MIT course Introduction to Machine Learning and currently supervises 11 graduate students. Her own mentoring style, Barzilay said, is based on her early teaching experience and her own way of learning.
“I personally believe every person is born with a certain set of talents,” Barzilay said. “You cannot make a tiger look like a crocodile. My job is to see where students’ talents lie. Within the scope of existing constraints, I’m trying to find problems where they can do research which will be playing on their strengths and their intellect. It took me a long time to understand that the only way I can successfully do research [is to] look at the problems I really care about, which would not necessarily be the same ones other people care about.”
Even her cancer diagnosis has changed the way she teaches. “Being sick really opened my eyes because I realized everybody’s life is short,” she said. “It’s better that you use this life for something you really care about.”
Her 11-year-old son, Tomer, thinks she deserves that MacArthur grant. “What’s surprising about her is that she can do her work and be an amazing person and genius,” he said, “but she also has a lot of time for me. Every Tuesday we get pizza and bubble tea. I’m proud of her. I really want her to succeed and I think she’s going down the path to success.” Tomer is undecided on a career, but -- unsurprisingly, perhaps -- is considering becoming either a computer programmer or a surgeon. Perhaps one day he can be both.