Data wrangling

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.

This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.[1]

Background

The "wrangler" non-technical term is often said to derive from work done by the United States Library of Congress's National Digital Information Infrastructure and Preservation Program (NDIIPP) and their program partner the Emory University Libraries based MetaArchive Partnership. The term "mung" has roots in munging as described in the Jargon File.[2] The term "Data Wrangler" was also suggested as the best analogy to coder for someone working with data.[3]

The terms data wrangling and data wrangler had sporadic use in the 1990s and early 2000s. One of the earliest business mentions of data wrangling was in an article in Byte Magazine in 1997 (Volume 22 issue 4) referencing “Perl’s data wrangling services”. In 2001 it was reported that CNN hired[4] “a dozen data wranglers” to help track down information for news stories.

One of the first mentions of data wrangling in a scientific context was by Donald Cline during the NASA/NOAA Cold Lands Processes Experiment.[5] Cline stated the data wranglers “coordinate the acquisition of the entire collection of the experiment data.” Cline also specifies duties typically handled by a storage administrator for working with large amounts of data. This can occur in areas like major research projects and the making of films with a large amount of complex computer-generated imagery. In research, this involves both data transfer from research instrument to storage grid or storage facility as well as data manipulation for re-analysis via high performance computing instruments or access via cyberinfrastructure-based digital libraries.

Typical use

The data transformations are typically applied to distinct entities (e.g. fields, rows, columns, data values etc.) within a data set, and could include such actions as extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating and filtering to create desired wrangling outputs that can be leveraged downstream.

The recipients could be individuals, such as data architects or data scientists who will investigate the data further, business users who will consume the data directly in reports, or systems that will further process the data and write it into targets such as data warehouses, data lakes or downstream applications.

Modus operandi

Depending on the amount and format of the incoming data, data wrangling has traditionally been performed manually (e.g. via spreadsheets such as Excel) or via scripts in languages such as Python or SQL. R, a language often used in data mining and statistical data analysis, is now also often[6] used for data wrangling.

Visual data wrangling systems were developed to make data wrangling accessible for non-programmers, and simpler for programmers. Some of these also include embedded AI recommenders and Programming by Example facilities to provide user assistance, and Program Synthesis techniques to autogenerate scalable dataflow code. Early prototypes of visual data wrangling tools include OpenRefine and the Stanford/Berkeley Wrangler research system[7]; the latter evolved into Trifacta.

Other terms for these processes have included data franchising[8], data preparation and data munging.

gollark: In Python, private/public/protected is mostly just convention and some underscores on the names, *except* `__` on attributes actually renames them to `__ClassName_attribute` or something internally (which you can get around obviously), *except* if it has `__` on the start *and* end it's one of the magic methods and does not get mangled.
gollark: They're *accessible* to everything; due to python, they are not considered private methods.
gollark: Those aren't defined on everything.
gollark: Oh, you mean `__str__` and `__int__` methods?
gollark: What?

See also

References

  1. What Is Data Munging?
  2. Jargon File entry for Mung
  3. Open Knowledge Foundation Blog Post
  4. Behind the Headlines at Revamped News
  5. Parsons, MA, MJ Brodzik, and NJ Rutter. 2004. Data management for the cold land processes experiment: improving hydrological science. HYDROL PROCESS. 18:3637-653. http://onlinelibrary.wiley.com/doi/10.1002/hyp.5801/abstract
  6. O’Reilly 2016 Data Science Survey
  7. Kandel, Sean; Paepcke, Andreas (May 2011). "Wrangler: Interactive Visual Specification of Data Transformation Scripts". SIGCHI. doi:10.1145/1978942.1979444.
  8. What is Data Franchising? (2003 and 2017 IRI)
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.