Using Orca

Orca is called from the command line with the following structure: Orca examines each example in the test-file and determines if it is an outlier by comparing it with the examples in the reference-file. The weight-file contains a list of the features and their weights. The test-file and reference-file can be the same, in which case, Orca compares each example to all others except itself. The format of the files is described below.

Data File Format

Orca assumes that the test-file and reference-file are stored in a binary file with the following format. The #examples is an integer encoding the total number of examples in the binary file. The #real and #discrete are integers that indicate the number of real and discrete variables for each variable. Each case is stored as follows

ID is a unique identifying integer for each case. Orca uses this ID number to reference outliers examples in its output. The ID variable is followed by the real variables and then the discrete variables.

There are no delimiters between fields and Orca keeps track of the element that it is accessing by the offset from the beginning of the file. In all cases, the numbers are encoded as 4 byte integers for the ID and discrete variables, and a 4 byte floating point numbers for the real variables.

In the binary data files, missing values are represented by a specific floating point or integer value. By default, continuous values are encoded as -989898 and categorical values are encoded as -1.

These data files can be automatically generated with DPrep which converts comma delimited data sets stored in text format.

Weight Format

The weight file stores the name of each feature and the weight that the feature should have in the distance function. For example, a weight file for the Adult data set could appear as follows.

The features are listed one per line, and each line contains a text string representing the name and a number representing the weight.

The weights are used in the distance measure used to compare examples. For continuous features the distance measure is weighted Euclidean distance. For discrete features, the distance measure is weighted Hamming distance.

DPrep automatically generates a weight file when converting a data set into the proper binary format.

Output Format

Running Orca on the adult data set with the command

starts the search for outliers. Orca will produce a variety of log information, including its progress in processing the files, and a list of the outliers found which looks like

The outliers are listed in descending order by their score. The first number is the rank of the outlier. The record number is the ID of the example. The score is the value of the outlier.

If the -rn option is used, Orca keeps track of the nearest neighbors for each outlier and then computes the contribution of each feature to the outlier score. The output looks like

As before, the outliers are listed in descending order according to their score. The neighbors lists the ID for the neighboring cases. For example, the five closest neighbors to the top outlier, Record 12094, are examples 19439, 39246, 46432, 18924, and 22226.

Following the list of neighbors, each feature is listed along with a weight representing its contribution to score. Specifically, the number is how much the outlier score would drop if that feature were not included in the distance score assuming that the nearest neighbors stay the same.

For example, consider the third outlier, Record 1827. As the data was processed with DPrep, the ID number is the same as the line number in the original text file. Looking up her record in the adult data set we find:

That is, the outlier represents a 22 year old, unmarried, white, female who is a High school graduate working in a professional specialty and earns over 50K per year. We can also look up her nearest neighbors, which are

We can see that she is like her neighbors in many respects and they all are self-employed (not incorporated), high-school graduates, earn more than $50K per year, and are white. Most of her neighbors are female (4 of 5). These factors all recieved low feature importance.

She differs substantially from her neighbors on four aspects. First, her occupation is a professional-specialty. Her neighbors are executive-managerial (3), administrative-clerical (1), or employed in craft-repair (1). Second, her native country is the Dominican Republic whereas her neighbors are all natives of the United-States. Third, she has never been married whereas four of the five neighbors are divorced (3), or widowed (1). Finally, at age 22 she is much younger than her neighbors who are aged 56, 51, 50, 50, and 30.

Options

There are many optional parameters that can be used with Orca and these are summarized in Table 1. Following the table, is a more detailed description of the options.

Outlier Options
 -avg use the average distance to the k nearest neighbors
 -kth use the distance to the kth nearest neighbor
 -n X find the top X outliers
 -k X use X nearest neighbors
 -c X initial cutoff
Computation Options
 -b X batch size
 -s X starting batch size
Miscellaneous Options
 -rn record nearest neighbors of the outliers
 -m X use X to represent missing values
 -woff ignore weights

 

-avg   average distance

This option sets the score function to the average distance to the k nearest neighbors. This score function is the default.

-kth   distance to kth neighbor

This option sets the score function to the kth nearest neighbor.

-n   number of outliers

This option sets the number of outliers to return.

-k   number of nearest neighbors

This option sets the number of nearest neighbors to use in the computation of outliers.

-c X   initial cutoff

The cutoff threshold is the minimum score an example must achieve to be an outlier. As the program executes and processes more examples from the test-file the cutoff gradually increases as more unusual examples are discovered. By default, the initial cutoff is set to zero and this guarantees finding the top n outliers in the data set. Setting the initial cutoff greater than zero can greatly reduce the running time.

-b   Batch Size

The batch size is the number of examples that Orca loads into main memory from the test-file to process concurrently. Varying speed by about an order of magnitude. The value for batch size can have a large effect on computation time and setting this properly may require trial and error. A small value results in more frequent data accesses slowing down computation. Larger values result in fewer data accesses but can result in slower computation times because cache efficiency decreases. On a Pentium IV 1.5Ghz machine I find about 1000 to be the best value.

-s X   Starting Batch Size

The number of examples to load into main memory on the first iteration. This is typically the most time consuming.

-rn   Record Neighbors

This option turns on storing of ID numbers of the nearest neighbors of an outlier. With the nearest neighbor information we can calculate the contribution of each feature to the distance score. This option slows computation and by default is off.

-m   missing values

Missing values are represented by a specific floating point for continuous fields and a specific integer for categorical fields.

-woff   ignore weights

Ignore the weights used in the distance function and treat all fields equally.

Copyright and Usage

This software is Copyright 2003 by the Institute for the Study of Learning and Expertise. Orca may be freely used for educational and research purposes by non-profit institutions and U.S. Government agencies. Other organizations may use Orca for evaluation purposes only. All further uses require prior approval.

This software is provided "as is" with no warranties of any kind, either expressed or implied, including, but not limited to implied warranties as to the performance, fitness, and merchantability of the software for a particular purpose.

The entire risk of using the software is with the user. The software is provided without any support or obligation to assist with its use. This software may not be sold or redistributed without prior approval.
 


2003-5-6