Cloud Blog: Storage Insights datasets: How to optimize storage spend with deep visibility

Source URL: https://cloud.google.com/blog/products/storage-data-transfer/storage-insights-datasets-optimizes-storage-footprint/
Source: Cloud Blog
Title: Storage Insights datasets: How to optimize storage spend with deep visibility

Feedly Summary: Managing vast amounts of data in cloud storage can be a challenge. While Google Cloud Storage offers strong scalability and durability, storage admins sometimes sometimes struggle with questions like: 

What’s driving my storage spend? 

Where is all my data in Cloud Storage and how is it distributed?

How can I search across my data for specific metadata such as age or size?

Indeed, to achieve cost optimization, security, and compliance, you need to understand what you have, where it is, and how it’s being used. That’s where Storage Insights datasets, a feature of Storage Intelligence for Cloud Storage, comes in. Storage Intelligence is a unified management product that offers multiple powerful capabilities to analyze large storage estates and easily take actions. It helps you explore your data, optimize costs, enforce security, and implement governance policies. Storage insights datasets help you deeply analyze your storage footprint and you can use Gemini Cloud Assist for quick analysis in natural language. Based on these analyses, you can take action, such as relocating buckets and performing large-scale batch operations.
In this blog, we focus on how you can use Insights datasets for cost management and visibility, exploring a variety of common use cases. This is especially useful for cloud administrators and FinOps teams performing cloud cost allocation, monitoring and forecasting.
What are Storage Insights datasets?
Storage Insights datasets provide a powerful, automated way to gain deep visibility into your Cloud Storage data. Instead of manual scripts, custom one-off reports for buckets or managing your own collection pipelines, Storage Insights datasets generate comprehensive reports about your Cloud Storage objects and their activities, placing them directly in a BigQuery linked dataset.
Think of it as X-ray vision for your Cloud Storage buckets. It transforms raw storage metadata into structured, queryable data that you can analyze with familiar BigQuery tools to gain crucial insights, with automatic data refreshes delivered every 24hrs (after the initial set up, which could take up to 48hrs for the first load).
Key features

Customizable scope: Set the dataset scope to be at the level of the organization, a folder containing projects, a project / set of projects, or a specific bucket. 

Metadata dataset: It provides a queryable dataset that contains bucket and object metadata directly in BigQuery.

Regular updates and retention: After the first load, datasets update with metadata every 24 hours and can retain data for up to 90 days.

aside_block
), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

Use cases
Calculate routine showbackUnderstanding who/which applications are consuming what storage is often the first step in effective cost management, especially for larger organizations. With Storage Insights datasets, your object and bucket metadata is available in BigQuery. You can run SQL queries to aggregate storage consumption by specific teams, projects, or applications. You can then attribute storage consumption by buckets or prefixes for internal chargeback or cost attribution, for example: “Department X used 50TB of storage in gs://my-app-data/department-x/ last month”. This transparency fosters accountability and enables accurate internal showback. 
Here’s an example SQL query to determine the total storage per bucket and prefix in the dataset:

code_block
<ListValue: [StructValue([(‘code’, "SELECT\r\n bucket,\r\n SPLIT(name, ‘/’)[\r\nOFFSET\r\n (0)] AS top_level_prefix,\r\n SUM(size) AS total_size_bytes\r\nFROM\r\n object_attributes_view\r\nGROUP BY\r\n bucket, top_level_prefix\r\nORDER BY\r\n total_size_bytes DESC;\r\n//Running queries in Datasets accrue BQ query costs, refer to the pricing details page for further details."), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9753b981f0>)])]>

Understand how much data you have across storage classesStorage Insights datasets identifies the storage class for every object in your buckets. By querying storageClass, timeCreated, updated in the object metadata view in BigQuery, you can quickly visualize your data distribution across various classes (standard, nearline, coldline, archive) for objects beyond a certain age, as well as when they were last updated. This lets you identify potentially misclassified data. It also provides valuable insights into whether you have entire buckets with coldline or archived data or if your objects unexpectedly moved across storage classes (for example, a file expected to be in archive is now in standard class) using the timeStorageClassUpdated object metadata.
Here’s an example SQL query to see all objects created two years ago, without any updates since and in standard class:

code_block
<ListValue: [StructValue([(‘code’, "SELECT\r\n bucket,\r\n name,\r\n size,\r\n storageClass,\r\n timeCreated,\r\n updated\r\nFROM object_attributes_latest_snapshot_view\r\nWHERE\r\n EXTRACT(YEAR\r\n FROM\r\n timeCreated) = EXTRACT(YEAR\r\n FROM\r\n DATE_SUB(CURRENT_DATE(), INTERVAL 24 MONTH))\r\n AND (updated IS NULL\r\n OR updated = timeCreated)\r\n AND storageClass = ‘STANDARD’\r\nORDER BY\r\n timeCreated;\r\n//Running queries in Datasets accrue BQ query costs, refer to the pricing details page for further details."), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9753b70d60>)])]>

Set lifecycle and autoclass policies: Automating your savings
Manual data management is time-consuming and prone to error. Storage Insights datasets helps you identify where the use of Object Lifecycle Management (OLM) or Autoclass might reduce costs.

Locate the buckets that don’t have OLM or Autoclass configured:  Through Storage Insights datasets, you can query bucket metadata to see which buckets lack defined lifecycle policies by using the field lifecycle, autoclass.enabled. If a bucket contains data that should naturally transition to colder storage or be deleted after a certain period, but has no policy, you can take the appropriate action by knowing which parts of your estate you need to investigate further. Storage Insights datasets provides the data to flag these "unmanaged" buckets, helping you enforce best practices.

Here’s an example SQL query to see all buckets with lifecycle or autoclass configurations enabled and all those without any active configuration:

code_block
<ListValue: [StructValue([(‘code’, "SELECT\r\n name AS bucket_name,\r\n storageClass AS default_class,\r\n CASE\r\n WHEN lifecycle = TRUE OR autoclass.enabled = TRUE THEN ‘Managed’\r\n ELSE ‘Unmanaged’\r\nEND\r\n AS lifecycle_autoclass_status\r\nFROM bucket_attributes_latest_snapshot_view\r\n//Running queries in Datasets accrue BQ query costs, refer to the pricing details page for further details."), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9753eac3d0>)])]>

Evaluate Autoclass impact: Autoclass automatically transitions objects between storage classes based on a fixed access timeline. But how do you know if it’s working as expected or if further optimization is needed? With Storage Insights datasets, you can find the buckets with autoclass enabled using the autoclass.enabled field and analyze object metadata by tracking the storageClass, timeStorageClassUpdated field over time for specific objects within Autoclass-enabled buckets. This allows you to evaluate the effectiveness of Autoclass, verify if the objects specified are indeed moving to optimal classes, and understand the real-world impact on your costs. For example, once you configure Autoclass on a bucket, you can visualize the movement of your data between storage classes on Day 31 as compared to Day 1 and understand how autoclass policies take effect on your bucket. 

Evaluate Autoclass suitability: Analyze your bucket’s data to determine if it’s appropriate to use Autoclass with it. For example, if you have short-lived data (less than 30 days old) in a bucket (you can assess objects in daily snapshots to determine the average life of an object in your bucket using timeCreated and timeDeleted), you may not want to turn on Autoclass.  

Here’s an example SQL query to find a count of all objects with age more than 30 days and age less than 30 days in bucketA and bucketB:

code_block
<ListValue: [StructValue([(‘code’, "SELECT\r\n SUM(\r\n CASE\r\n WHEN TIMESTAMP_DIFF(t1.timeDeleted, t1.timeCreated, DAY) < 30 THEN 1\r\n ELSE 0\r\n END\r\n ) AS age_less_than_30_days,\r\n SUM(\r\n CASE\r\n WHEN TIMESTAMP_DIFF(t1.timeDeleted, t1.timeCreated, DAY) > 30 THEN 1\r\n ELSE 0\r\n END\r\n ) AS age_more_than_30_days\r\nFROM\r\n `object_attributes_view` AS t1\r\nWHERE\r\n t1.bucket IN ( ‘bucketA’, ‘bucketB’)\r\n AND t1.timeCreated IS NOT NULL\r\n AND t1.timeDeleted IS NOT NULL;\r\n//Running queries in Datasets accrue BQ query costs, refer to the pricing details page for further details."), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9753eac580>)])]>

Proactive cleanup and optimization
Beyond routine management, Storage Insights datasets can help you proactively find and eliminate wasted storage.

Quickly find duplicate objects: Accidental duplicates are a common cause of wasted storage. You can use object metadata like size, name or even crc32c checksums in your BigQuery queries to identify potential duplicates. For example, finding multiple objects with the exact same size, checksum and similar names might indicate redundancy, prompting further investigation.

Here’s an example SQL query to list all objects where their size, crc32c checksum field and name are the same values (indicating potential duplicates):

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT\r\n name,\r\n bucket,\r\n timeCreated,\r\n crc32c,\r\n size\r\nFROM (\r\n SELECT\r\n name,\r\n bucket,\r\n timeCreated,\r\n crc32c,\r\n size,\r\n COUNT(*) OVER (PARTITION BY name, size, crc32c) AS duplicate_count\r\n FROM\r\n `object_attributes_latest_snapshot_view` )\r\nWHERE\r\n duplicate_count > 1\r\nORDER BY\r\nsize DESC;\r\n//Running queries in Datasets accrue BQ query costs, refer to the pricing details page for further details.’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9753ac9850>)])]>

Find temporary objects to be cleaned up: Many applications generate temporary files that, if not deleted, accumulate over time. Storage Insights datasets allows you to query for objects matching specific naming conventions (e.g., *_temp, *.tmp), or located in "temp" prefixes, along with their creation dates. This enables you to systematically identify and clean up orphaned temporary data, freeing up valuable storage space.

Here’s an example SQL query to find all log files created a month ago:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT\r\n name, bucket, timeCreated, size\r\n FROM\r\n \’object_attributes_latest_snapshot_view\’\r\n WHERE\r\n name LIKE "%.log"\r\nAND DATE(timeCreated) <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH)\r\nORDER BY\r\nsize DESC;\r\n//Running queries in Datasets accrue BQ query costs, refer to the pricing details page for further details.’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e9753b8bbe0>)])]>

List all objects older than a certain date for easy actioning: Need to archive or delete all images older than five years for compliance? Or perhaps you need to clean up logs that are older than 90 days? Storage Insights datasets provides timeCreated and contentType for every object. A simple BigQuery query can list all objects older than your specified date, giving you a clear, actionable list of objects for further investigation. You can use Storage Intelligence batch operations, which allows you to action on billions of objects in a serverless manner.
Check SoftDelete suitability: Find buckets that have a high storage size of data that has been soft deleted by querying for the presence of softDeleteTime and size in the object metadata tables. In those cases, data seems temporary and you may need to investigate soft delete cost optimization opportunities. 

Taking your analysis further
The true power of Storage Intelligence Insights datasets lies not just in the raw data it provides, but in the insights you can derive and the subsequent actions you can take. Once your Cloud Storage metadata is in BigQuery, the possibilities for advanced analysis and integration are vast.
For example, you can use Looker Studio, Google Cloud’s no-cost data visualization and dashboarding tool, to directly connect to your BigQuery Insights datasets, transforming complex queries into intuitive, interactive dashboards. Now you can:

Visualize cost trends: Create dashboards that show storage consumption by project, department, or storage class over time. This allows teams to easily track spending, identify spikes, and forecast future costs.

Track fast-growing buckets: Analyze the buckets with the most growth in the past week or month, and compare them against known projects for accurate cost attribution. Use Looker’s alerting capabilities to notify you when certain thresholds are met, such as a sudden increase in the total size of data in a bucket.

Set up custom charts for common analysis: For routine FinOps use cases (such as tracking buckets without OLM policies configured or objects past their retention expiration time), you can generate weekly reports to relevant teams for easy actioning.

You can also use our template here to connect to your dataset for quick analysis or you can create your own custom dashboard.

Get started
Configure Storage Intelligence and create your dataset to start analyzing your storage estate via a 30-day trial today. Please refer to our pricing documentation for cost details. 
Set up your dataset to a scope of your choosing and start analyzing your data: 

Configure a set of Looker Studio dashboards based on team or departmental usage for monthly analysis by the central FinOps team.

Use BigQuery for ad-hoc analysis and to retrieve specific insights.

For a complete cost picture, you can integrate your Storage Insights dataset with your Google Cloud billing export to BigQuery. Your billing export provides granular details on all your Google Cloud service costs, including Cloud Storage.

AI Summary and Description: Yes

**Summary:**
The text discusses how Google Cloud’s Storage Insights datasets enhance data management for cloud storage administrators. It emphasizes the importance of cost optimization, security, and compliance by providing detailed visibility into stored data. Key features of the Storage Insights include customizable dataset scopes, automated reporting, and actionable insights through SQL queries, serving particularly FinOps and cloud administration professionals.

**Detailed Description:**
The piece outlines the challenges faced by storage administrators in managing vast amounts of data within Google Cloud Storage, highlighting the necessity for understanding data distribution and costs. Storage Insights datasets, part of Storage Intelligence, offer comprehensive tools for these tasks. Here are the significant points discussed:

– **Challenges in Cloud Storage Management:**
– Understanding storage expenditure.
– Detecting data location and distribution.
– Searching for metadata, such as object age and size.

– **Features of Storage Insights Datasets:**
– **Customizable Scope:** Datasets can be organized at various levels (organization, project, folder, bucket).
– **Metadata Dataset:** Provides a queryable dataset with metadata directly linked to BigQuery.
– **Regular Updates:** Automatic metadata updates every 24 hours after the initial load.

– **Use Cases:**
– **Cost Management and Showback:** Enables visibility into which applications consume storage. SQL queries can be run for reporting and chargeback purposes.
– **Data Classification and Optimization:** Identifies storage classes and potential misclassifications to ensure optimal cost management. Users can visualize data distribution across storage classes.
– **Automating Lifecycle Policies:** Assists in identifying buckets lacking Object Lifecycle Management (OLM) or Autoclass configurations to enforce best practices.
– **Proactive Storage Management:** Helps identify orphaned temporary files and potential duplicates to eliminate wasted storage.

– **Advanced Analysis Capabilities:**
– Integrate with Looker Studio for visual analytics of storage data, helping monitor trends and track cost allocation effectively.
– Leverage dashboards for ongoing team assessments and quick ad-hoc analysis using BigQuery.

– **Getting Started:**
Users are encouraged to configure Storage Intelligence and use the insights gained to monitor and analyze their cloud storage for cost effectiveness and compliance.

Overall, the insights provided through this service significantly aid cloud administrators, particularly those involved in FinOps, by automating data analysis processes and enabling better compliance and cost-efficiency efforts in managing cloud storage resources.