Analysis and Classification

Spectra Detect Worker analyzes files submitted via the Worker API and produces a detailed analysis report for every file using the built-in Spectra Core static analysis engine.

Analysis reports can be retrieved in several ways, depending on the Worker configuration. It is also possible to control the contents of the report to an extent.

The format of the report is described in a separate document: Spectra Detect Analysis Schema.

Retrieving Analysis Reports

There are two ways to get the file analysis report(s):

  1. The Get information about a processing task endpoint. Sending a GET request to the endpoint with the task ID returns the analysis report in the response.

  2. Saved on one of the configured integrations.

    • S3 - for hosted deployments, this is the only supported integration

    • Microsoft services (Azure Storage, SharePoint, OneDrive)

    • file shares (NFS, SMB)

    • Splunk

    • callback server

Adding Custom Data to the Report

Users can also save any custom data in the analysis report by submitting it in the file upload request.

The custom_data field accepts any user-defined data as a JSON-encoded payload. This data is included in all file analysis reports (Worker API, Callback, AWS S3, Azure Data Lake and Splunk, if enabled) in its original, unchanged form. The custom_data field will not be returned in the Get information about a processing task endpoint response if the file has not been processed yet.

Users should avoid using request_id as a key in their custom_data, as that value is used internally by the appliance.

Example - Submitting a file with the custom_data parameter to store user-defined information in the report

curl -X POST \
   https://tiscale-worker-01/api/tiscale/v1/upload \
   -H 'Authorization: Token 94a269285acbcc4b37a0ad335d221fab804a1d26' \
   -F file=@Classification.pdf \
   -F 'custom_data={
            "file_source" : {
               "uploader" : "malware_analyst",
               "origin" : "china"
            }}'

Customizing Analysis Reports

There are several different ways of customizing an analysis report:

  1. through report configuration

  2. through report types

  3. through report views

These methods are not mutually exclusive and are applied in the order above (configuration first, then report type, then report view). For example, to even be present for later filtering/transforming, strings found in a file must be included in the report.

Report types are results of filtering the full report. In other words, fields can be included or excluded as required. On the other hand, report views are results of transforming parts of the report, such as field names or the structure of the report. Historically, views could also be used to just filter out certain fields without any transformations, and this functionality has been maintained for backward compatibility. However, filtering-only views should be replaced by their equivalent report types as they are much faster.

As previously mentioned, filtering and transforming actions are not mutually exclusive. You can filter out some fields (using a report type), and then perform a transformation on what remains (using a report view). However, not all report views are compatible with all report types. This is because some report views expect certain fields to be present.

Report Types

Report types are JSON configuration files with the following format:

{
   "name": "string",
   "exclude_fields": true,
   "fields": {
      "example_field": false,
      "another_example": {
         "example_subfield": false,
         "another_subfield": false
      }
   }
}
small

Contains only the classification of the file, and some information about the file.

extended_small

Contains information about file classification, information about the file, the story field, tags and interesting_strings.

medium

This is the default report that’s served when there are no additional query parameters (in other words, it’s not necessary to specifically request this report, as it’s sent by default). It is equivalent to the previous “summary” report with some small differences:

  • each subreport contains an index and parent field

  • if metadata.application.capabilities is 0, then this field is not present in the report

Changes in this report:

  • excludes the entire relationships section

  • excludes certain fields under the info section, such as warnings and errors

  • many metadata fields are not present such as those related to certificates

  • there are no strings, no story and no tags

large

Includes every single field present in the analysis report. It is equivalent to the previous “full” report (?full=true).


Report types that replace report views with the same name:

classification

This report returns only the classification of the file, story and info. It has no metadata except the attack field.

classification_tags

Same as the classification view, with the addition of Spectra Core tags.

extended

Compared to the default (medium) type, contains:

  • all metadata, relationships, tags

  • the story field

  • under info, contains statistics and unpacking information

mobile_detections

Contains mobile-related metadata, as well as classification and story.

mobile_detections_v2

Contains more narrowly defined mobile metadata, with exclusive focus on Android. Also contains classification and story.

short_cert

Contains certificate and signature-related metadata, as well as indicators and some classification info.

The name of the report type is the string you’ll refer to when calling the Get information about a processing task endpoint (or the one passed to the relevant configuration command). For example, if the name of your report type is my-custom-report-type, you would include it in the query parameters as follows: ?report_type=my-custom-report-type.

The exclude_fields field defines the behavior of report filtering. This is an optional field and is false by default. This means that, by default, the report fields under fields will be included (you explicitly say which fields you want). Conversely, if this value is set to true, then the report fields under fields will be excluded (you explicitly say which fields you don’t want).

The fields nested dictionary contains the fields that are either included or excluded (depending on the value of exclude_fields). If a subfield is set to a boolean value (true/false), then the inclusion/exclusion applies to that section and all sections under it.

For example:

small.json
{
  "name": "small",
  "fields": {
    "info" : {
      "file": true,
      "identification": true
    },
    "classification" : true
  }
}

In this configuration, we’re explicitly including fields (exclude_fields was not set, so it’s false by default). Setting individual fields to true will make them (and their subfields) appear in the final report. In other words, the only sections that will be in the final report are the entire classification section and the file and identification fields from the info section. Everything else will not be present.

Or:

exclude-example.json
{
  "name": "exclude-example",
  "exclude_fields": true,
  "fields": {
    "relationships": false,
    "info": {
      "statistics": false,
      "binary_layer": false,
    }
   }
}

In this configuration, the entire relationships section is excluded, as well as statistics and binary_layer from the info section. Everything else will be present in the report.

Limitations

  • info.file cannot be excluded and is always present.

  • Items in arrays cannot be selectively included or excluded (entire arrays only).

  • Items that are JSON primitives (string, number, boolean, null) cannot be excluded if they’re on the same level as an included field, or above an included field. Take this example structure:

    Example structure (only e has been explicitly included)
    {
      "a": {
        "b": 1
        "c": {
          "d": "foo",
          "e": "bar",
          "f": {
            "g": "text",
            "h": 1
           }
        },
        "x": [
           1,
           2,
           3
        ],
        "y": {
          "z": "hello",
          "w": "world"
        }
      }
    }
    

    If you include e, you will also get d (because it’s on the same level as e, and is a primitive data type), but you will also get b (because it’s on the level above, and is a primitive data type as well):

    Filtering result
    {
      "a": {
        "b": 1,
        "c": {
          "d": "foo",
          "e": "bar"
        }
      }
    }
    

    However, you will not get f, x or y as they are non-primitives (objects and arrays).

Report Views

Views are transformations of the JSON analysis output produced by the Worker. For example, views can be used to change the names of some sections in the analysis report. There are also deprecated views that allow filtering fields in or out, but this functionality is covered by report types (see above). The following views are present by default (deprecated views are excluded):

classification_top_container_only

Returns a report view equivalent to the classification report type (see above), but for the top-level container (parent file).

flat

“Flattens” the JSON structure.

Without flattening:

"tc_report": [
   {
       "info": {
           "file": {
               "file_type": "Binary",
               "file_subtype": "Archive",
               "file_name": "archive.zip",
               "file_path": "archive.zip",
               "size": 20324816,
               "entropy": 7.9999789976332245,

With flattening:

"tc_report": [
   {
       "info_file_entropy": 7.9999789976,
       "info_file_file_name": "archive.zip",
       "info_file_file_path": "archive.zip",
       "info_file_file_subtype": "Archive",
       "info_file_file_type": "Binary",
flat-one

Returns the flat report, but only for the parent file.

no_goodware

Returns a short version of the report for the top-level container, and any children files that are suspicious or malicious (goodware files are filtered out). This view is not compatible with split reports.

no_email_indicator_reasons

Strips potential PII (personally identifiable information) from some fields in analysis reports for email messages, and replaces it with a placeholder string.

splunk-mod-v1

Transforms the report so that it’s better suited for indexing by Splunk. The changes are as follows:

  • if classification is 0 or 1, factor becomes confidence

  • if classification is 2 or 3, factor becomes severity

  • a string_status field is added with the overall classification (UNKNOWN, GOODWARE, SUSPICIOUS, MALICIOUS)

  • scanner name becomes reason

  • scanner result becomes threat

Views can generally be applied to both split (available in self-hosted deployments) and non-split reports. If none of these views satisfy your use case, contact ReversingLabs Support to get help with building a new custom view.

Interpreting the report

After sending files to be processed, users receive a link to a JSON report (see Response format in Worker - Get information about a processing task). It contains a tc_report field, which looks something like this:

"tc_report": [
{
   "info": {
      "file": {
      "file_type": "Text",
      "file_subtype": "Shell",
      "file_name": "test.sh",
      "file_path": "test.sh",
      "size": 35,
      "entropy": 3.7287244452691413,
      "hashes": []
   }
   },
   "classification": {
      "propagated": false,
      "classification": 0,
      "factor": 0,
      "scan_results": [
         {}
      ],
      "rca_factor": 0
   }
}
]

High-level overview

The key information here is the classification value, which will be a number from 0 to 3:

  • 0: unknown (no threats found)

  • 1: goodware

  • 2: suspicious

  • 3: malicious

More information

For more information, use the rca_factor. The higher its value, the more dangerous the threat, except for files that weren’t classified (their classification is 0). In that case, rca_factor will be 0 and will not signal trustworthiness.

rca_factor

Common reasons for classification

0

File comes from a very trustworthy domain or has a very trustworthy certificate. Examples: HP, IBM, Microsoft, Oracle, Intel, Dell, Sony, Google…

1

File comes from a trustworthy domain or has a trustworthy certificate. Examples: php.net, mit.edu, postgresql.org, redhat.de, opera.com, nasa.gov…

2

File comes from a usually trusted domain. Examples: softpedia.com, sourceforge.net, cnet.com…

3

File comes from another known site.

4

Some valid but not very trusted certificates.

5

Low trust source, no whitelisted certificates.

6

Adware, potentially unwanted apps, tools for masking malware (packers).

7

Spyware.

8

Tools used to introduce malware or to use infected machines for denial-of-service attacks.

9

Malicious browser extensions, fake antivirus software, rootkits.

10

Virus, worm, trojan, keylogger, infostealer. Most dangerous threats.

Detailed inspection

For even more information on why a file was given a certain classification, look at the scan_results. This field contains the individual scanners which processed the file (name), as well as their reason for classifying a file a given way (result).

Spectra Core

Spectra Detect Worker relies on the built-in Spectra Core static analysis engine to classify files and produce the analysis report.

The file classification system can produce the following classification states: goodware, suspicious, malicious. With this classification system, any file found to contain a malicious file will be considered malicious itself if classification propagation has been enabled in the configuration. In the default configuration, propagation is enabled.

Multiple technologies are used for file classification, such as: format identification (malware packers), signatures (byte pattern matches), file structure validation (format exploits), extracted file hierarchy, file similarity (RHA1), certificates, machine learning (for Windows executables), heuristics (scripts and fileless malware) and YARA rules. These are shipped with the static analysis engine, and their coverage varies based on threat and file format type.

Classifying Files with Cloud-Enabled Spectra Core

Spectra Core can be connected to Spectra Intelligence to use file reputation data. This data is not based solely on antivirus scanning results, but on the interpretation of the accuracy of those results by ReversingLabs, as well as on the analyst-provided (manual) classification overrides. Note that the only information Spectra Core submits to the Spectra Intelligence cloud is the file hash.

Connecting Spectra Core to the cloud will add threat reputation to the scan_results in the report, for example:

{
"ignored": false,
"type": "av",
"classification": 0,
"factor": 0,
"name": "mcafee_online",
"rca_factor": 0
}

To connect Spectra Core to Spectra Intelligence, in the Manager, go to Central Configuration > Spectra Intelligence.

How Spectra Intelligence Enhances Spectra Core Classification

Threat Naming Accuracy

When classifying files, Spectra Core takes all engines listed in the analysis report (Spectra Intelligence included) into consideration. Based on their responses, it selects the technology that provides the most accurate threat naming. More specific methods that identify the malware family more accurately are given precedence. Generic and heuristic classification are picked last, and only if there is no better-named classification response.

Spectra Intelligence generally returns specific threat names, and it will be selected as authoritative if a better option is not available. It can also enhance Spectra Core classification results. For instance, Spectra Core machine learning can classify malware only with heuristic, non-named classification. If Spectra Core finds ransomware via machine learning, the threat name will appear as Win32.Ransomware.Heuristic.

However, if Spectra Core is connected to Spectra Intelligence, the cloud response can change the threat name to something better-defined, such as Win32.Ransomware.GandCrab. This helps users understand exactly which malware family they are dealing with, as opposed to just the threat type (ransomware).

Whitelisting and Goodware Overrides

When not connected to Spectra Intelligence, Spectra Core can classify files as goodware either based on digital certificates that were used to sign the files, or via graylisting - a system that can declare certain file types as goodware based on the lack of code detected within them.

When connected to Spectra Intelligence, whitelisting can be expanded based on file reputation and origin information. As a result, the number of unknown files (files without classification) will be significantly reduced. Users also get an insight into the trustworthiness of whitelisted files measured through trust factor values.

If classification propagation is enabled on Spectra Core, a whitelisted file can still be classified as suspicious or malicious if any of its extracted files is classified as suspicious or malicious.

Goodware overrides is a feature designed to prevent this. When enabled, it ensures that any files extracted from a parent and whitelisted by certificate, source, or user override can no longer be classified as malicious or suspicious. Spectra Core automatically calculates the trust factor of the certificates before applying goodware overrides, and does not use certificates with the trust factor lower than the user-configured goodware override factor.

With goodware overrides enabled, classification is suppressed on any files extracted from whitelisted containers. In this case, whitelisting done by either Spectra Intelligence or certificates will be the final classification outcome. Spectra Core will still report all malicious files it finds, but they won’t be displayed as the final file classification.

This feature allows for more advanced heuristic classifications that have a chance of catching supply chain attacks. As those rules tend to be noisy, they can be suppressed by using this feature. The user can still see all engine classification results, and can use them to proactively hunt for possible supply chain attacks.

Note that the goodware overrides will not apply if the extracted file is found by itself (for example, if an extracted file is rescanned without its container).

Mapping Spectra Intelligence and Spectra Core Classification States

When Spectra Core is connected to Spectra Intelligence, it uses a combination of two classification systems - Spectra Core file classification and Malware Presence (MWP). While their malicious and suspicious classification states translate well from one to another, MWP known to Spectra Core goodware does not.

By default, Spectra Core converts all MWP known states to goodware. This can be problematic when false negative scans happen in the Spectra Intelligence cloud, as the cloud would declare a file as KNOWN, but in reality, the file would be undiscovered malware. Such files generally have a low trust factor value.

To resolve those issues, users can rescan files classified as non-malicious to confirm whether they are false negatives, or configure Spectra Core to map Spectra Intelligence known to goodware based on the trust factor.

Users can configure the MWP Goodware Factor value, which defines the threshold for automatically converting MWP known to Spectra Core goodware. When this value is configured, instead of converting known to goodware in all cases, Spectra Core will only convert it when a file has a trust factor value equal to or lower than the configured one.

By default, the value is configured to 5 (convert all), and the recommended value is 2. In this case, if a file classified as known in the cloud has a trust factor greater than 2, Spectra Core will not consider that file as goodware. It will be considered unknown (not classified), and its cloud classification will not be present in the list of scanners in the Spectra Core analysis report.

Spectra Detect Decision Matrix

The following section explains how to interpret the classification data provided by Spectra Detect. The intent is to maximize the effectiveness of malicious classifications, while reducing the negative impact false positive detections might have.

Start the decision-making process by looking at the top-level file, found first in the Spectra Detect report. Perform the following checks in order.

Unknown classification

  1. If the value of tc_report.classification.classification is “0” (unknown, no threats found), Spectra Detect detected no threats and the file is not present in ReversingLabs Cloud (or in the T1000 database).

    The sample could receive a classification at a later date, or it can be uploaded for analysis to the Spectra Intelligence cloud.

    After cloud classification, the sample may be marked as 3 (Malicious), 2 (Suspicious) or 1 (Known/Goodware). ReversingLabs reserves the “Known/Goodware” classification only for samples that the classification algorithm deems as trustworthy. If a sample remains “Unknown”, analysis did not find malicious intent at that time, but the sample and its metadata are not trustworthy enough to be declared “Known/Goodware.”

Known classification

  1. If the value of tc_report.classification.classification is 1 (known), the file has been analyzed and the analysis did not find any known threats. To determine if the file is goodware, trusted and clean, perform the following checks:

    1. If the value of tc_report.classification.factor is 0 or 1, the file is goodware and it comes from a highly reputable source.

    2. If the value of tc_report.classification.factor is 2 or 3, the file is clean and it comes from known public sources usually unrelated to malware infections.

    3. If the value of tc_report.classification.factor is 4 or 5, the file is likely clean and it comes from known public sources, some of which have been known to carry malware in the past. Files with this factor can change to other classifications over time, or their factor can improve when they are found in better sources.

Suspicious classification

  1. If the value of tc_report.classification.classification is 2 (suspicious), the file has been analyzed and the analysis found a possibility that the file is a new type of threat. This classification category is reserved for static analysis and cloud reputation heuristics, and it can lead to false positives. Depending on the risk aversion profile, two approaches are advised:

    1. High risk tolerance - suspicious classifications are allowed as heuristics trigger often on files. This most commonly happens in the case where the files are collected in a corrupted or truncated state before analysis.

    2. Low risk tolerance - suspicious classifications are not allowed. However, filtering on specific reasons for suspicious classifications is still available via the first element in the tc_report.classification.scan_results[0].name list.

Malicious classification

  1. If the value of tc_report.classification.classification is 3 (malicious), the file has been analyzed and recognized as a known malware threat. Depending on the risk aversion profile, two approaches are advised:

    1. High risk tolerance - malicious files are not allowed, but PUA (potentially unwanted applications) are. Lower risk malware, PUA, have tc_report.classification.factor set to 1. In case PUA are allowed, an additional filter that blocks files with a factor greater than 1 is advised.

    2. Low risk tolerance - malicious files are not allowed regardless of classification reason.

Classification Propagation

Spectra Detect unpacks files during analysis, so it is possible to have a file that is classified based on its content. Any file that contains a malicious or suspicious file is also considered malicious or suspicious because of its content.

Propagated Classification Suppression

In some cases, unpacked files might contain files that are misclassified. These false positives are propagated to the top, and it may appear that the entire archive is malicious.

Suppression of these classifications is possible, and is safe, under the following conditions:

  1. The classification is caused by propagation. When this happens, the optional field tc_report.classification.propagation_source exists.

  2. Either of the following or all of the following:

    • The top-level file has been found in a trusted source. Find the scanner named av in the tc_report.classification.scan_results scanner list. If it exists, and its classification is 1 (known), check its factor value. If the factor value is 0 or 1, the file is goodware and it comes from a highly reputable source.

    • The top-level file is signed by a trusted party. Find the scanner named “TitaniumCore Certificate Lists” in the tc_report.classification.scan_results scanner list. If it exists and its classification is 1 (known), the file is goodware regardless of its factor value.

Propagation Suppression Example

The following false positive scenario illustrates how the suppression logic is applied:

  1. A file is considered whitelisted because it is signed by a trusted digital certificate.

  2. It has been classified as known and highly trusted in ReversingLabs Cloud, and has no positive antivirus detections in ReversingLabs Cloud.

  3. However, during extraction, one or more malicious files were detected inside the file, and one or more malicious detections have been declared as a false positive.

The classification of the file that has been submitted to Spectra Detect is indicated in the top-level section. Since it contains the propagation_source field, that indicates the top-level file is considered malicious because it contains at least one malicious file. The SHA1 value points to this file, which is the origin of the final top-level classification.

The scanners in the scan_results list enumerate all the factors that contributed to the final classification. For example, they might include:

  1. Resulting classification from propagation: Classification is malicious ( classification: 3 ) with factor: 2

  2. Certificate whitelisting for the top-level file: Classification is goodware ( classification: 1 ) with factor: 0

  3. Cloud response for the top-level file: Classification is goodware ( classification: 1 ) with factor: 0

The suppression algorithm can be applied to this top-level file, as it is not only signed with a whitelisted certificate, but is also considered known and highly trusted in ReversingLabs Cloud.