With all the hoopla surrounding the rollout of Obamacare, one federal healthcare mandate tucked into the American Recovery and Reinvestment Act of 2009 has evolved with relatively little public fanfare.

Yanjun Li’s algorithms work through pure text to create data clusters.
Photo by Tom Stoelker

Nevertheless, the mandate, which requires healthcare providers to digitize all their patients’ medical records by 2015, will have a huge impact on patients and on the field of informatics.

Massive quantities of data coming online will need to be organized. It’s the sort of problem that will require an algorithmic solution, and that’s just the sort of problem that Yanjun Li, Ph.D. revels in solving.

Li, an associate professor in the computer and information sciences department, is a data miner. Hers is a science that plays an important role in society beyond spying and marketing.

Her specialization is in text mining, with a focus on the English language. She is working on algorithms that take pure text documents and cluster them. For example, by sorting through nearly 1,000 newspaper articles that don’t have obvious hints as to their content in a headline or byline, an algorithm can determine whether the articles should be grouped in several clusters and filed under headings such as sports, travel, or business. If an article falls under sports, it can then be sorted further into baseball, football, or tennis, for example.

Sorting through sports articles may seem rather straightforward, but where some may find similarities, others find distinctions. Finding similarity for grouping and categorizing, Li said, often boils down to the meaning of a word.

This can get complicated when several words can mean the same thing.

Li said that while “vehicle” and “automobile” can mean the same thing, “car” and “truck” cannot. Yet, within certain algorithms, each is considered to be a vehicle and an automobile.
“We call this ontology, also known as a ‘words network’ in our text-mining algorithm,” said Li. “If a newspaper article mentions a sedan and another mentions a truck, they use different words but they are all in the same semantic tree, on this same branch, so we can group them together.”

Ontology has been adopted to improve the performance of text-mining algorithms. On the other hand, these text-mining techniques could be used to build ontologies for specific domains, such as medical documents. Patients hoping to get a second or third opinion will no longer have to get a fresh set of tests from each doctor. Instead, multiple doctors can review the patient’s same electronic medical records.

If a doctor wants to dig deeper into a patient’s history, that’s where the categorization will become very important: What medicines has the patient taken? For how long? What are the side effects of those medications?

Li said that other industries besides health care also require clustering and categorization. Any business merger and acquisition melds more than just corporate culture: Companies must merge their documents—sometimes millions of them.

She said that an algorithm is often used as a preparative step—a way to cluster together documents and arrange them into initial groups before locking them into their final home.
“The best way to start is to jump in and say, ‘This should be here and this should be here,’” said Li. “Maybe 80 percent of the documents end up where they belong. It’s a helpful first step.”

While the general impression is that computer algorithms do all the work, Li said that domain experts also play an important role in the clustering and classification of text documents.

Li chose to teach computer and information science rather than to practice it in what is a lucrative private industry, because she sees teaching as her natural calling. While studying with classmates in China and later in the United States, Li received continuous feedback from her peers that she had a knack for explaining complex problems.

Though she understands quite a bit more than she did when she was an undergraduate, Li said that she still tries to remember what it was like to start from scratch.

“Sometimes it’s hard to explain on the same level as your students,” she said. “But I try to put myself in their shoes and remember how I learned. I tell my students, ‘I’m not teaching you, I’m sharing my learning experience with you.’”