Using Orca

Orca is called from the command line with the following structure:

  orca test-file reference-file weight-file [options]

Orca examines each example in the test-file and determines if it is an outlier by comparing it with the examples in the reference-file. The weight-file contains a list of the features and their weights. The test-file and reference-file can be the same, in which case, Orca compares each example to all others except itself. The format of the files is described below.

Data File Format

Orca assumes that the test-file and reference-file are stored in a binary file with the following format.

#examples #real #discrete case1 case2 ... caseN

The #examples is an integer encoding the total number of examples in the binary file. The #real and #discrete are integers that indicate the number of real and discrete variables for each variable. Each case is stored as follows

ID R1 R2 ... Rm D1 D2 ... Dp

ID is a unique identifying integer for each case. Orca uses this ID number to reference outliers examples in its output. The ID variable is followed by the real variables and then the discrete variables.

There are no delimiters between fields and Orca keeps track of the element that it is accessing by the offset from the beginning of the file. In all cases, the numbers are encoded as 4 byte integers for the ID and discrete variables, and a 4 byte floating point numbers for the real variables.

In the binary data files, missing values are represented by a specific floating point or integer value. By default, continuous values are encoded as -989898 and categorical values are encoded as -1.

These data files can be automatically generated with DPrep which converts comma delimited data sets stored in text format.

Weight Format

The weight file stores the name of each feature and the weight that the feature should have in the distance function. For example, a weight file for the Adult data set could appear as follows.

age 1.0
capital-gain 1.0
capital-loss 1.0
hours-per-week 1.0
workclass 0.4
education 0.4
marital-status 0.4
occupation 0.4
relationship 0.4
race 0.4
native-country 0.4
salary 0.4
sex 0.4

The features are listed one per line, and each line contains a text string representing the name and a number representing the weight.

The weights are used in the distance measure used to compare examples. For continuous features the distance measure is weighted Euclidean distance. For discrete features, the distance measure is weighted Hamming distance.

DPrep automatically generates a weight file when converting a data set into the proper binary format.

Output Format

Running Orca on the adult data set with the command

> orca adult.bin adult.bin adult.weights

starts the search for outliers. Orca will produce a variety of log information, including its progress in processing the files, and a list of the outliers found which looks like

Top outliers:

  1. Record: 12094 Score: 8.92967

  2. Record: 17539 Score: 8.71359

  3. Record: 1827 Score: 8.33826

  4. Record: 39783 Score: 8.04264

  5. Record: 16261 Score: 8.02936

  6. Record: 4657 Score: 8.02299

  7. Record: 42601 Score: 7.75865

The outliers are listed in descending order by their score. The first number is the rank of the outlier. The record number is the ID of the example. The score is the value of the outlier.

If the -rn option is used, Orca keeps track of the nearest neighbors for each outlier and then computes the contribution of each feature to the outlier score. The output looks like

Top outliers:

  1. Record: 12094 Score: 8.92967
     Neighbors: 19439 39246 46432 18924 22226 
     feature importance: 
       capital-gain: 3.74
       native-country: 1.60
       race: 0.80
       salary: 0.80
       relationship: 0.80
       education: 0.40
       marital-status: 0.40
       age: 0.35
       hours-per-week: 0.03
       workclass: 0.00
       sex: 0.00
       occupation: 0.00
       capital-loss: 0.00

  2. Record: 17539 Score: 8.71359
     Neighbors: 8618 47830 37344 32371 45612 
     feature importance: 
       capital-gain: 2.52
       native-country: 2.00
       workclass: 1.20
       sex: 1.20
       race: 0.80
       relationship: 0.40
       marital-status: 0.40
       age: 0.14
       hours-per-week: 0.05
       salary: 0.00
       occupation: 0.00
       education: 0.00
       capital-loss: 0.00

  3. Record: 1827 Score: 8.33826
     Neighbors: 38025 12534 18127 12910 20614 
     feature importance: 
       occupation: 2.00
       native-country: 2.00
       marital-status: 1.60
       relationship: 1.60
       age: 0.68
       sex: 0.40
       hours-per-week: 0.06
       workclass: 0.00
       salary: 0.00
       race: 0.00
       education: 0.00
       capital-loss: 0.00
       capital-gain: 0.00

As before, the outliers are listed in descending order according to their score. The neighbors lists the ID for the neighboring cases. For example, the five closest neighbors to the top outlier, Record 12094, are examples 19439, 39246, 46432, 18924, and 22226.

Following the list of neighbors, each feature is listed along with a weight representing its contribution to score. Specifically, the number is how much the outlier score would drop if that feature were not included in the distance score assuming that the nearest neighbors stay the same.

For example, consider the third outlier, Record 1827. As the data was processed with DPrep, the ID number is the same as the line number in the original text file. Looking up her record in the adult data set we find:

1827:  22,Self-emp-not-inc,202920,HS-grad,9,Never-married,Prof-specialty,Unmarried,White,Female,99999,0,40,Dominican-Republic,>50K

That is, the outlier represents a 22 year old, unmarried, white, female who is a High school graduate working in a professional specialty and earns over 50K per year. We can also look up her nearest neighbors, which are

38025:  50,Self-emp-not-inc,203004,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Female,99999,0,60,United-States,>50K
12534:  50,Self-emp-not-inc,155118,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Female,99999,0,35,United-States,>50K
18127:  51,Self-emp-not-inc,111283,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Female,99999,0,35,United-States,>50K
12910:  56,Self-emp-not-inc,163212,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,99999,0,40,United-States,>50K
20614:  30,Self-emp-not-inc,115932,HS-grad,9,Never-married,Craft-repair,Not-in-family,White,Male,99999,0,50,United-States,>50K

We can see that she is like her neighbors in many respects and they all are self-employed (not incorporated), high-school graduates, earn more than $50K per year, and are white. Most of her neighbors are female (4 of 5). These factors all recieved low feature importance.

She differs substantially from her neighbors on four aspects. First, her occupation is a professional-specialty. Her neighbors are executive-managerial (3), administrative-clerical (1), or employed in craft-repair (1). Second, her native country is the Dominican Republic whereas her neighbors are all natives of the United-States. Third, she has never been married whereas four of the five neighbors are divorced (3), or widowed (1). Finally, at age 22 she is much younger than her neighbors who are aged 56, 51, 50, 50, and 30.

Options

There are many optional parameters that can be used with Orca and these are summarized in Table 1. Following the table, is a more detailed description of the options.

Outlier Options
	-avg	use the average distance to the k nearest neighbors
	-kth	use the distance to the kth nearest neighbor
	-n X	find the top X outliers
	-k X	use X nearest neighbors
	-c X	initial cutoff
Computation Options
	-b X	batch size
	-s X	starting batch size
Miscellaneous Options
	-rn	record nearest neighbors of the outliers
	-m X	use X to represent missing values
	-woff	ignore weights

-avg average distance

This option sets the score function to the average distance to the k nearest neighbors. This score function is the default.

-kth distance to kth neighbor

This option sets the score function to the kth nearest neighbor.

-n number of outliers

This option sets the number of outliers to return.

-k number of nearest neighbors

This option sets the number of nearest neighbors to use in the computation of outliers.

-c X initial cutoff

The cutoff threshold is the minimum score an example must achieve to be an outlier. As the program executes and processes more examples from the test-file the cutoff gradually increases as more unusual examples are discovered. By default, the initial cutoff is set to zero and this guarantees finding the top n outliers in the data set. Setting the initial cutoff greater than zero can greatly reduce the running time.

-b Batch Size

The batch size is the number of examples that Orca loads into main memory from the test-file to process concurrently. Varying speed by about an order of magnitude. The value for batch size can have a large effect on computation time and setting this properly may require trial and error. A small value results in more frequent data accesses slowing down computation. Larger values result in fewer data accesses but can result in slower computation times because cache efficiency decreases. On a Pentium IV 1.5Ghz machine I find about 1000 to be the best value.

-s X Starting Batch Size

The number of examples to load into main memory on the first iteration. This is typically the most time consuming.

-rn Record Neighbors

This option turns on storing of ID numbers of the nearest neighbors of an outlier. With the nearest neighbor information we can calculate the contribution of each feature to the distance score. This option slows computation and by default is off.

-m missing values

Missing values are represented by a specific floating point for continuous fields and a specific integer for categorical fields.

-woff ignore weights

Ignore the weights used in the distance function and treat all fields equally.

Copyright and Usage

This software is Copyright 2003 by the Institute for the Study of Learning and Expertise. Orca may be freely used for educational and research purposes by non-profit institutions and U.S. Government agencies. Other organizations may use Orca for evaluation purposes only. All further uses require prior approval.

This software is provided "as is" with no warranties of any kind, either expressed or implied, including, but not limited to implied warranties as to the performance, fitness, and merchantability of the software for a particular purpose.

The entire risk of using the software is with the user. The software is provided without any support or obligation to assist with its use. This software may not be sold or redistributed without prior approval.

2003-5-6