Using DPrep

DPrep converts a data set stored in a comma delimited text file to the binary format required by ORCA. DPrep will scale continuous features to the range [0,1] or normalize them by subtracting the mean and dividing by the standard deviation. DPrep will also randomize the order of the data set with a disk-based shuffling algorithm.

DPrep is called as follows

dprep data-file fields-file output-file [options]

The data-file is the name of a comma delimited text file storing the data examples. The fields-file specifies the file with a description of the attributes which includes information on which fields to use.

DPrep goes through four stages as follows:

writes a weight file for use with orca,
converts the data set to binary,
scales the data set, and
randomizes the data.

To run DPrep on the sample adult database type

> dprep adult.data adult.fields adult.bin

Data File

The data file stores the data in a comma delimited format, with one example per line. For example, the first several records of the adult data set should appear as

39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K

Missing values for continuous and discrete fields should be represented with a question mark (?). For example, in the record below the second (workclass) and seventh (occupation) fields are missing their values.

54,?,180211,Some-college,10,Married-civ-spouse,?,Husband,Asian-Pac-Islander,Male,0,0,60,South,>50K

Fields File

The fields file contains a listing of the attributes in the data set and a description of the allowable values. For example, the fields file for the adult data set should appear as follows,

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: ignore.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: ignore.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: discrete.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: discrete.
salary: >50K, <=50K.

There is one attribute per line and each attribute should have a name followed by a description of allowable values. The attributes can take be defined in four ways listed in Table 1.

Table 1: List of valid attributes types.
continuous	The attribute takes real values.
discrete	The attribute takes on a discrete number of categorical values and DPREP should compile these from the data
ignore	The attribute should be ignored and DPrep will not include it in the binary output file.
a comma separated list of names	The attribute takes on a categorical value from the list members.

Options

Table 1 summarizes the options available for DPrep. This is followed by a more detailed description of the individual options.

Table 2: Summary of DPrep options.
Scaling Options
	-snone	no scaling of continuous fields
	-s01	scale continuous fiedls to range [0,1]
	-sstd	scale continuous fields to zero mean and unit standard deviation
Disk Based Randomization Options
	-rand	randomize
	-norand	do not randomize
	-i X	execute X iterations of shuffling (5)
	-rf X	use X temporary files for disk shuffling (10)
	-seed X	random number seed X (time based)
Miscellaneous Options
	-m X	float point number for encoding real missing values
	-cleanf	clean temporary files at end
	-cleand	clean temporary during execution
	-cleann	do not clean temporay files

-snone no scaling

This option turns off scaling of continuous attributes.

-s01 scale to range [0,1]

This option tells DPrep to scale all continuous features to the range [0,1]. For each continuous feature, DPrep finds the minimum and maximum values. It then scales each feature by subtracting the minimum value and dividing by the range (maximum-minimum).

-sstd scale by mean and standard deviation

This option tells DPrep to put all continuous variables into standard form by subtracting the mean from each feature and dividing by the standard deviation.

-norand no randomization

Turns of randomization. That is, DPrep will not randomize the order of examples in data-file and will preserve the ordering in the binary output file. This option should only be used when the data file is already randomized or has no ordering dependencies (such as when artificial data is generated from a known probability distribution).

-i X number of shufflings

This option sets the number of iterations DPrep uses to randomize the ordering of examples. In each iteration, DPrep randomly assigns each example to a random temporary file and then concatenates the files in random order.

-rf X number of temporary files

This option sets the number of temporary files to be used during randomization.

Copyright and Usage

This software is Copyright 2003 by the Institute for the Study of Learning and Expertise. DPrep may be freely used for educational and research purposes by non-profit institutions and U.S. Government agencies. Other organizations may use ORCA for evaluation purposes only. All further uses require prior approval.

This software is provided "as is" with no warranties of any kind, either expressed or implied, including, but not limited to implied warranties as to the performance, fitness, and merchantability of the software for a particular purpose.

The entire risk of using the software is with the user. The software is provided without any support or obligation to assist with its use. This software may not be sold or redistributed without prior approval.

2003-5-6