DPrep converts a data set stored in a comma delimited text file to the binary format required by ORCA. DPrep will scale continuous features to the range [0,1] or normalize them by subtracting the mean and dividing by the standard deviation. DPrep will also randomize the order of the data set with a disk-based shuffling algorithm.
DPrep is called as follows
dprep data-file fields-file output-file [options]
The data-file is the name of a comma delimited text file storing the data examples. The fields-file specifies the file with a description of the attributes which includes information on which fields to use.
DPrep goes through four stages as follows:
> dprep adult.data adult.fields adult.bin
The data file stores the data in a comma delimited format, with one example per line. For example, the first several records of the adult data set should appear as
39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K 50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K 38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K 53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K 28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
Missing values for continuous and discrete fields should be represented with a question mark (?). For example, in the record below the second (workclass) and seventh (occupation) fields are missing their values.
The fields file contains a listing of the attributes in the data set and a description of the allowable values. For example, the fields file for the adult data set should appear as follows,
age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: ignore. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: ignore. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: discrete. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: discrete. salary: >50K, <=50K.
There is one attribute per line and each attribute should have a name followed by a description of allowable values. The attributes can take be defined in four ways listed in Table 1.
|continuous||The attribute takes real values.|
|discrete||The attribute takes on a discrete number of categorical values and DPREP should compile these from the data|
|ignore||The attribute should be ignored and DPrep will not include it in the binary output file.|
|a comma separated list of names||The attribute takes on a categorical value from the list members.|
Table 1 summarizes the options available for DPrep. This is followed by a more detailed description of the individual options.
|-snone||no scaling of continuous fields|
|-s01||scale continuous fiedls to range [0,1]|
|-sstd||scale continuous fields to zero mean and unit standard deviation|
|Disk Based Randomization Options|
|-norand||do not randomize|
|-i X||execute X iterations of shuffling (5)|
|-rf X||use X temporary files for disk shuffling (10)|
|-seed X||random number seed X (time based)|
|-m X||float point number for encoding real missing values|
|-cleanf||clean temporary files at end|
|-cleand||clean temporary during execution|
|-cleann||do not clean temporay files|
-snone no scaling
This option turns off scaling of continuous attributes.
-s01 scale to range [0,1]
This option tells DPrep to scale all continuous features to the range [0,1]. For each continuous feature, DPrep finds the minimum and maximum values. It then scales each feature by subtracting the minimum value and dividing by the range (maximum-minimum).
-sstd scale by mean and standard deviation
This option tells DPrep to put all continuous variables into standard form by subtracting the mean from each feature and dividing by the standard deviation.
-norand no randomization
Turns of randomization. That is, DPrep will not randomize the order of examples in data-file and will preserve the ordering in the binary output file. This option should only be used when the data file is already randomized or has no ordering dependencies (such as when artificial data is generated from a known probability distribution).
-i X number of shufflings
This option sets the number of iterations DPrep uses to randomize the ordering of examples. In each iteration, DPrep randomly assigns each example to a random temporary file and then concatenates the files in random order.
-rf X number of temporary files
This option sets the number of temporary files to be used during randomization.
This software is Copyright 2003 by the Institute for the Study of Learning and Expertise. DPrep may be freely used for educational and research purposes by non-profit institutions and U.S. Government agencies. Other organizations may use ORCA for evaluation purposes only. All further uses require prior approval.
This software is provided "as is" with no warranties of any kind, either expressed or implied, including, but not limited to implied warranties as to the performance, fitness, and merchantability of the software for a particular purpose.
The entire risk of using the software is with the user. The software is provided without any support or obligation to assist with its use. This software may not be sold or redistributed without prior approval.