“PClean is the first scalable, well-engineered, general-purpose solution based on generative data modeling, which has to be the right way to go. “Ensuring data quality is a huge problem in the real world, and almost all existing solutions are ad-hoc, expensive, and error-prone,” says Russell, professor of computer science at UC Berkeley. The idea that probabilistic cleaning based on declarative, generative knowledge could potentially deliver much greater accuracy than machine learning was previously suggested in a 2003 paper by Hanna Pasula and others from Stuart Russell’s lab at the University of California at Berkeley. Mansinghka, a principal research scientist in the Department of Brain and Cognitive Sciences. I can also give PClean hints, tips, and tricks I've already discovered for solving the task faster.”Ĭo-authors are Monica Agrawal, a PhD student in EECS David Sontag, an associate professor in EECS and Vikash K. It lets me tell the computer what I know about a problem, encoding the same kind of background knowledge I'd explain to a person helping me clean my data. “PClean is a step toward closing that gap. With a human, I get to assume all those things,” he says. That's because in today's dominant programming languages, I have to give step-by-step instructions, which can't assume that the computer has any context about the world or task - or even just common-sense reasoning abilities. “When I ask a friend for help with something, it's often easier than asking a computer. For example, given additional knowledge about typical rents, PClean infers the correct Beverly Hills is in California because of the high cost of rent where the respondent lives.Īlex Lew, the lead author of the paper and a PhD student in the Department of Electrical Engineering and Computer Science (EECS), says he’s most excited that PClean gives a way to enlist help from computers in the same way that people seek help from one another. PClean combines this knowledge via common-sense probabilistic reasoning to come up with the answer. Users can give PClean background knowledge about the domain and about how data might be corrupted. How can you know in which the person lives? This is where PClean’s expressive scripting language comes in. What if someone said they lived in Beverly Hills but left the state column empty? Though there is a well-known Beverly Hills in California, there’s also one in Florida, Missouri, and Texas … and there’s a neighborhood of Baltimore known as Beverly Hills. Take, for instance, the problem of cleaning state names in a database of apartment listings. PClean uses a knowledge-based approach to automate the data cleaning process: Users encode background knowledge about the database and what sorts of issues might appear. PClean provides generic common-sense models for these kinds of judgment calls that can be customized to specific databases and types of errors. Automating the task is challenging because different datasets require different types of cleaning, and common-sense judgment calls about objects in the world are often needed (e.g., which of several cities called “Beverly Hills” someone lives in). The system, called PClean, is the latest in a series of domain-specific probabilistic programming languages written by researchers at the Probabilistic Computing Project that aim to simplify and automate the development of AI applications (others include one for 3D perception via inverse graphics and another for modeling time series and databases).Īccording to surveys conducted by Anaconda and Figure Eight, data cleaning can take a quarter of a data scientist's time. MIT researchers have created a new system that automatically cleans “dirty data” - the typos, duplicates, missing values, misspellings, and inconsistencies dreaded by data analysts, data engineers, and data scientists.
0 Comments
Leave a Reply. |