orca test-file reference-file weight-file [options]Orca examines each example in the test-file and determines if it is an outlier by comparing it with the examples in the reference-file. The weight-file contains a list of the features and their weights. The test-file and reference-file can be the same, in which case, Orca compares each example to all others except itself. The format of the files is described below.
#examples #real #discrete case1 case2 ... caseNThe #examples is an integer encoding the total number of examples in the binary file. The #real and #discrete are integers that indicate the number of real and discrete variables for each variable. Each case is stored as follows
ID R1 R2 ... Rm D1 D2 ... Dp
ID is a unique identifying integer for each case. Orca uses this ID number to reference outliers examples in its output. The ID variable is followed by the real variables and then the discrete variables.
There are no delimiters between fields and Orca keeps track of the element that it is accessing by the offset from the beginning of the file. In all cases, the numbers are encoded as 4 byte integers for the ID and discrete variables, and a 4 byte floating point numbers for the real variables.
In the binary data files, missing values are represented by a specific floating point or integer value. By default, continuous values are encoded as -989898 and categorical values are encoded as -1.
These data files can be automatically generated with DPrep which converts comma delimited data sets stored in text format.
age 1.0 capital-gain 1.0 capital-loss 1.0 hours-per-week 1.0 workclass 0.4 education 0.4 marital-status 0.4 occupation 0.4 relationship 0.4 race 0.4 native-country 0.4 salary 0.4 sex 0.4
The features are listed one per line, and each line contains a text string representing the name and a number representing the weight.
The weights are used in the distance measure used to compare examples. For continuous features the distance measure is weighted Euclidean distance. For discrete features, the distance measure is weighted Hamming distance.
DPrep automatically generates a weight file when converting a data set into the proper binary format.
Running Orca on the adult data set with the command
> orca adult.bin adult.bin adult.weights
starts the search for outliers. Orca will produce a variety of log information, including its progress in processing the files, and a list of the outliers found which looks like
Top outliers: 1. Record: 12094 Score: 8.92967 2. Record: 17539 Score: 8.71359 3. Record: 1827 Score: 8.33826 4. Record: 39783 Score: 8.04264 5. Record: 16261 Score: 8.02936 6. Record: 4657 Score: 8.02299 7. Record: 42601 Score: 7.75865
The outliers are listed in descending order by their score. The first number is the rank of the outlier. The record number is the ID of the example. The score is the value of the outlier.
If the -rn option is used, Orca keeps track of the nearest neighbors for each outlier and then computes the contribution of each feature to the outlier score. The output looks like
Top outliers: 1. Record: 12094 Score: 8.92967 Neighbors: 19439 39246 46432 18924 22226 feature importance: capital-gain: 3.74 native-country: 1.60 race: 0.80 salary: 0.80 relationship: 0.80 education: 0.40 marital-status: 0.40 age: 0.35 hours-per-week: 0.03 workclass: 0.00 sex: 0.00 occupation: 0.00 capital-loss: 0.00 2. Record: 17539 Score: 8.71359 Neighbors: 8618 47830 37344 32371 45612 feature importance: capital-gain: 2.52 native-country: 2.00 workclass: 1.20 sex: 1.20 race: 0.80 relationship: 0.40 marital-status: 0.40 age: 0.14 hours-per-week: 0.05 salary: 0.00 occupation: 0.00 education: 0.00 capital-loss: 0.00 3. Record: 1827 Score: 8.33826 Neighbors: 38025 12534 18127 12910 20614 feature importance: occupation: 2.00 native-country: 2.00 marital-status: 1.60 relationship: 1.60 age: 0.68 sex: 0.40 hours-per-week: 0.06 workclass: 0.00 salary: 0.00 race: 0.00 education: 0.00 capital-loss: 0.00 capital-gain: 0.00
As before, the outliers are listed in descending order according to their score. The neighbors lists the ID for the neighboring cases. For example, the five closest neighbors to the top outlier, Record 12094, are examples 19439, 39246, 46432, 18924, and 22226.
Following the list of neighbors, each feature is listed along with a weight representing its contribution to score. Specifically, the number is how much the outlier score would drop if that feature were not included in the distance score assuming that the nearest neighbors stay the same.
For example, consider the third outlier, Record 1827. As the data was processed with DPrep, the ID number is the same as the line number in the original text file. Looking up her record in the adult data set we find:
1827: 22,Self-emp-not-inc,202920,HS-grad,9,Never-married,Prof-specialty,Unmarried,White,Female,99999,0,40,Dominican-Republic,>50KThat is, the outlier represents a 22 year old, unmarried, white, female who is a High school graduate working in a professional specialty and earns over 50K per year. We can also look up her nearest neighbors, which are
38025: 50,Self-emp-not-inc,203004,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Female,99999,0,60,United-States,>50K 12534: 50,Self-emp-not-inc,155118,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Female,99999,0,35,United-States,>50K 18127: 51,Self-emp-not-inc,111283,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Female,99999,0,35,United-States,>50K 12910: 56,Self-emp-not-inc,163212,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,99999,0,40,United-States,>50K 20614: 30,Self-emp-not-inc,115932,HS-grad,9,Never-married,Craft-repair,Not-in-family,White,Male,99999,0,50,United-States,>50K
We can see that she is like her neighbors in many respects and they all are self-employed (not incorporated), high-school graduates, earn more than $50K per year, and are white. Most of her neighbors are female (4 of 5). These factors all recieved low feature importance.
She differs substantially from her neighbors on four aspects. First, her occupation is a professional-specialty. Her neighbors are executive-managerial (3), administrative-clerical (1), or employed in craft-repair (1). Second, her native country is the Dominican Republic whereas her neighbors are all natives of the United-States. Third, she has never been married whereas four of the five neighbors are divorced (3), or widowed (1). Finally, at age 22 she is much younger than her neighbors who are aged 56, 51, 50, 50, and 30.
There are many optional parameters that can be used with Orca and these are summarized in Table 1. Following the table, is a more detailed description of the options.
Outlier Options | ||
-avg | use the average distance to the k nearest neighbors | |
-kth | use the distance to the kth nearest neighbor | |
-n X | find the top X outliers | |
-k X | use X nearest neighbors | |
-c X | initial cutoff | |
Computation Options | ||
-b X | batch size | |
-s X | starting batch size | |
Miscellaneous Options | ||
-rn | record nearest neighbors of the outliers | |
-m X | use X to represent missing values | |
-woff | ignore weights |
-avg average distance
This option sets the score function to the average distance to the k nearest neighbors. This score function is the default.
-kth distance to kth neighbor
This option sets the score function to the kth nearest neighbor.
-n number of outliers
This option sets the number of outliers to return.
-k number of nearest neighbors
This option sets the number of nearest neighbors to use in the computation of outliers.
-c X initial cutoff
The cutoff threshold is the minimum score an example must achieve to be an outlier. As the program executes and processes more examples from the test-file the cutoff gradually increases as more unusual examples are discovered. By default, the initial cutoff is set to zero and this guarantees finding the top n outliers in the data set. Setting the initial cutoff greater than zero can greatly reduce the running time.
-b Batch Size
The batch size is the number of examples that Orca loads into main memory from the test-file to process concurrently. Varying speed by about an order of magnitude. The value for batch size can have a large effect on computation time and setting this properly may require trial and error. A small value results in more frequent data accesses slowing down computation. Larger values result in fewer data accesses but can result in slower computation times because cache efficiency decreases. On a Pentium IV 1.5Ghz machine I find about 1000 to be the best value.
-s X Starting Batch Size
The number of examples to load into main memory on the first iteration. This is typically the most time consuming.
-rn Record Neighbors
This option turns on storing of ID numbers of the nearest neighbors of an outlier. With the nearest neighbor information we can calculate the contribution of each feature to the distance score. This option slows computation and by default is off.
-m missing values
Missing values are represented by a specific floating point for continuous fields and a specific integer for categorical fields.
-woff ignore weights
Ignore the weights used in the distance function and treat all fields equally.
This software is Copyright 2003 by the Institute for the Study of Learning and Expertise. Orca may be freely used for educational and research purposes by non-profit institutions and U.S. Government agencies. Other organizations may use Orca for evaluation purposes only. All further uses require prior approval.
This software is provided "as is" with no warranties of any kind, either expressed or implied, including, but not limited to implied warranties as to the performance, fitness, and merchantability of the software for a particular purpose.
The entire risk of using the software is with the user. The software is provided without any support or obligation to assist with its use. This software may not be sold or redistributed without prior approval.