Why Should I Use CIM Datamodels?
Using datamodels simplifies searching and can dramatically improve performance. This post is a breakdown of how this works, to answer the common question, “why should I use data models in Splunk?”
Data Models organize data categorically
This may be easiest to demonstrate with network data, because these data sources are probably the most similar between different vendors. Traffic logs from Cisco Firewalls, Juniper Switches, and AWS VPC Flow Logs generally contain very similar information. These logs can be normalized in Splunk with consistent field names, and accessed at once from the Network_Traffic datamodel.
Beyond network data there are some interesting advantages to organizing data this way. These advantages boil down to what kinds of questions are being asked. Security and Operations questions can often be broken into particular domains – for example, network, process, file system, configuration, performance, and authentication.
Data from completely different products can almost always be tied back to these domains.
You can ask questions about your environment in general
Network data can be used as another example here. A common use case is to compare all network traffic to a list of known malicious IP addresses and report on matches. This would be complicated if we’re accounting for every source of network data, across separate networks, business units and subsidiaries, and between multiple cloud accounts which could be in seperate indexes. In addition to the challenge of writing the search, there’s a fundamental performance challenge, as an enormous ammount of raw logs would need to be scanned, so that this search may not be practical to run in the first place.
Writing this kind of search against the Network_Traffic datamodel is trivial in comparison. Data is still seperated by index in the underlying filesystem, and access control restrictions still apply, but index filters are not needed in the search. Additionally the Network_Traffic datamodel is accelerated by default in Enterprise Security, so the performance issues mentioned above are less of a concern. (More on acceleration in a bit.) A search could be written on all allowed network traffic, then compared to a lookup containing known malicious IPs, and filtered for matches.
The same approach can be used with all of the technology domains mentioned in the previous section. Other common use cases involve looking for known malicious proesses across the environment based on file hashes and file names, and analyzing authentication times and location for users across all endpoints.
Acceleration lets you get more out of the same Splunk environment
Data Model Acceleration allows for a dramatic improvement in search performance. The network traffic example above is a dense search, and if performed on raw logs would require millions of events to be scanned. An accelerated search looking at the same information may still be expensive. However, the search processor has much less information to iterate over as the data is stored in a format that is well-optimized for this kind of search.
Assets table enrichment can simplify searches
In an ES environment, Assets and Identities fields are included in the datamodels. One of the easiest tricks to simplify whitelisting or targeting is to use Assets and Identities categories within searches against the datamodels. This way categorization can be managed centrally, and searches can be filtered in a way that’s easy to write and understand. These fields can also benefit from acceleration, so this could speed up searches somewhat in accelerated datamodels.
For example, searches that alert on DNS issues may need to filter out domain controllers. Instead of adding each DC’s hostname or IP as a filter, use a category field to exclude the “domain_controller” category. The assets table can be manually or automatically updated to add the “domain_controller” value to the category field for each DC.