Landing Zone GCP Enterprise Foundation
Hierarchy
- Org → Folders → Projects → Resources
- Policies inherit downward
- Use folders per environment (dev/staging/prod)
Shared VPC
- Host project (network) + Service projects (workloads)
- Centralized Cloud NAT, FW rules, VPN/Interconnect
- Service projects get subnets via Shared VPC admin
Org Policies
constraints/compute.requireOsLoginconstraints/iam.allowedPolicyMemberDomainsconstraints/compute.vmExternalIpAccess- Domain restricted sharing
VPC Service Controls
- Perimeter around managed services (GCS, BQ, etc.)
- Prevents data exfiltration
- Dry-run → Enforced
- Ingress/Egress rules for on-prem access
Checklist
- Org + Folders created
- Shared VPC host/service project setup
- Cloud Interconnect/VPN to on-prem
- Cloud DNS forwarding + on-prem resolver
- IAM custom roles at org/folder level
- VPC Service Controls perimeter
- Cloud Armor + FW rules
- Logging: aggregation sink to BQ + GCS
Hybrid Networking GCP ↔ On-Prem
Hub-and-Spoke Topology
- Hub VPC hosts Shared VPC, VPN/Interconnect
- Spoke VPCs peer with hub (no transitive peering!)
- Use Network Connectivity Center for large scale
Cloud Interconnect
- Dedicated: 10/100 Gbps, physical cross-connect
- Partner: via supported provider (Equinix, Megaport)
- VLAN attachments → Cloud Router → BGP
Cloud VPN
- HA VPN: 2 tunnels (two Cloud Router IPs)
- Classic VPN: single tunnel (deprecated)
- BGP dynamic routing recommended
Cloud Router & BGP
- Advertise VPC subnets via BGP
- Custom route advertisements
- Global dynamic routing for multi-region
Key Design Pattern
On-Prem ──[BGP]── Cloud Router ── Shared VPC Hub
── Spoke VPC (Prod)
── Spoke VPC (Dev)
── PSC to managed services
Private Service Connect PSC
Concept
- Producer publishes service via Service Attachment
- Consumer creates PSC endpoint (internal IP) in own VPC
- Traffic stays on Google network — never leaves it
vs VPC Peering
- PSC: consumer gets IP from its own range
- Peering: full bi-directional connectivity, no IP overlap allowed
- PSC: consumer/producer IPs can overlap!
Hands-on Commands
# Producer: create service attachment
gcloud compute service-attachments create SA_NAME \
--region=us-central1 \
--producer-forwarding-rule=FWD_RULE \
--nat-subnets=NAT_SUBNET
# Consumer: create PSC endpoint
gcloud compute addresses create PSC_IP --region=us-central1 \
--subnet=consumer-subnet
gcloud compute forwarding-rules create FR_NAME \
--region=us-central1 \
--target-service-attachment=projects/PROD/regions/.../SA_NAME \
--address=PSC_IP
Use Cases
- Access GCP managed services (GCS, Bigtable) privately
- Expose your internal API privately to consumers
- Multi-tenant SaaS on GCP
DNS w/ PSC
- Create private DNS zone with A record pointing to PSC IP
- Attach to consumer VPC
- On-prem: DNS forwarding zone → Cloud DNS → resolves PSC
Hybrid DNS GCP ↔ On-Prem Resolution
Cloud DNS → On-Prem
- Create DNS forwarding zone in Cloud DNS
- Forward
onprem.example.com→ on-prem DNS server IP - Uses VPN/Interconnect path (private zone)
On-Prem → Cloud DNS
- On-prem DNS forwards
*.gcp.example.com→ Cloud DNS inbound endpoint - Cloud DNS inbound server policy → bound to VPC
End-to-End Flow
On-Prem App ──→ On-Prem DNS
└── .gcp.example.com? ──→ Cloud DNS Inbound EP
└── gcp.internal? ──→ resolves via Private Zone
└── psc.example.com? ──→ PSC private zone → PSC IP
Cloud DNS Inbound Setup
gcloud dns inbound-server-policy create POLICY \
--network=NETWORK \
--private-alternative-ip-utilization
DNS Peering
- Peer zones between projects without forwarding
- Use for Shared VPC service project resolution
IAM & Policy Identity & Access Management
IAM Types
- Primitive: Owner/Editor/Viewer (do not use)
- Predefined: e.g.
roles/storage.admin - Custom: least privilege, scoped to permissions
IAM Conditions
- Attribute-based: resource type, IP range, time
resource.matchTag,request.time- Evaluated at access time
Deny Policies
- Explicit deny overrides any allow
- Org-level deny:
iam.denyadmin role - Use for break-glass scenarios
Best Practices
- Use groups, not individual users
- Custom roles at folder level, not project
- Audit with Cloud Asset Inventory
- Policy Analyzer:
gcloud policies analyze
Key Terms
- Principal = who (user, group, SA, workspace domain)
- Role = collection of permissions
- Policy = bind principal + role [+ condition] to resource
- Resource Manager = org/folder/project hierarchy
Network Security
Firewall Rules (VPC FW)
- Ingress + Egress, stateful
- Priority: 0-65535 (lower = higher)
- Implicit deny-all at end
- Network tags for targeted rules
- Service accounts as source/dest
Cloud Armor
- WAF + DDoS protection at LB edge
- Pre-configured rules: XSS, SQLi, L7 DDoS
- Custom rules: rate limiting, IP block, geo
- Can do bot management (reCAPTCHA)
VPC Firewall Rules
gcloud compute firewall-rules create allow-internal \
--network=my-vpc \
--priority=1000 \
--direction=INGRESS \
--source-ranges=10.0.0.0/8 \
--rules=tcp:0-65535,udp:0-65535
Private Google Access
- VMs without external IP access GCP APIs privately
- Enabled per subnet
- Routes via default internet gateway (Google-only)
Security Layers (Defense in Depth)
- Edge: Cloud Armor (WAF) + Google Front End
- VPC: FW rules + Private Google Access + VPC SC
- Workload: IAM + Workload Identity + gVisor (Sandbox)
- Data: CMEK, CSEK, DLP
Kubernetes Architecture GKE & Core Concepts
Control Plane Components
- API Server: front-end, auth, validation
- etcd: distributed key-value store
- Scheduler: assigns pods to nodes
- Controller Manager: watches state, reconciles
Worker Node Components
- kubelet: agent, communicates with API server
- kube-proxy: network rules, iptables/ipvs
- Container Runtime: containerd
GKE Specific
- Autopilot vs Standard (node management)
- Workload Identity: K8s SA → GCP IAM (no keys!)
- GKE Dataplane V2 (eBPF, Cilium-based)
- Node auto-upgrade, auto-repair
Workload Resources
- Deployment: stateless apps, rolling update
- StatefulSet: stable identities, sticky storage
- DaemonSet: one pod per node (logging, monitoring)
- Job/CronJob: batch processing
Networking Model
- Every pod gets a real IP (flat network)
- No NAT between pods (CNI: e.g. Calico, Cilium)
- Service types: ClusterIP, NodePort, LoadBalancer, ExternalName
- Ingress (L7) and Gateway API (L7/L4)
Helm Kubernetes Package Manager
Core Concepts
- Chart: package (templates + values + deps)
- Release: running instance of a chart
- Repository: chart storage (OCI, HTTP)
- Values:
values.yamloverridable with--set
Helm 3 Changes
- No Tiller (pure client-side)
- 3-way strategic merge for upgrades
- Release info stored as Secrets
- OCI-based chart registry support
Common Commands
helm create my-chart # scaffold
helm lint ./my-chart # validate
helm template . --values prod.yaml # render locally
helm install my-release ./my-chart -f prod.yaml
helm upgrade --install my-release ./my-chart -f prod.yaml
helm rollback my-release 2 # rollback to revision 2
helm list -A # list all releases
Chart Structure
my-chart/
Chart.yaml # metadata
values.yaml # default values
templates/ # Go templates
charts/ # subcharts (deps)
crds/ # CRDs
Hooks
- pre/post-install, upgrade, delete, rollback
- Useful for: DB migration, config validation
- Annotation:
helm.sh/hook: post-install
Microservices Deployment on K8s
Design Patterns
- Sidecar: logging, proxy alongside app (e.g. Istio envoy)
- Ambassador: proxy as sidecar for external access
- Adapter: normalize monitoring output
- Circuit Breaker: fail-fast, retry w/ backoff
Service Mesh (Istio)
- Envoy sidecar proxies intercept traffic
- mTLS between services (mutual TLS)
- Traffic splitting (canary, blue-green)
- Observability: traces, metrics, logs
Deployment Strategies
- Rolling: gradual replacement (default K8s)
- Blue-Green: two full environments, switch DNS/LB
- Canary: small % traffic to new version, monitor, ramp
- Feature flags: toggle on/off without deploy
CI/CD Pipeline Azure DevOps & GitHub Actions
Azure DevOps Pipeline
azure-pipelines.ymlin repo root- Stages: Build → Test → Deploy
- Environments: dev → staging → prod (approval gates)
- Variables + Variable Groups + Key Vault integration
- Service Connection → GCP (Workload Identity Federation)
GitHub Actions
.github/workflows/*.yml- Events: push, PR, schedule, workflow_dispatch
- OIDC to GCP (no static creds)
- Matrix builds, caches, artifacts
Pipeline Flow (Full)
[PR] → Lint + Unit Test + SonarQube + Checkov (IaC)
↓ merge to main
[Push] → Build → Tag (semver) → Push to Artifact Registry
↓
[Terraform] → terraform plan (PR) → terraform apply (merge)
↓
[Deploy] → Helm upgrade --install → Smoke test
↓
[Prod] → Approval gate → Canary → 100% rollout
Key Practices
- Trunk-based: short-lived branches, main is deployable
- GitFlow: develop → release branches, for larger teams
- Semantic versioning:
vMAJOR.MINOR.PATCH - SBOM generation in pipeline (Trivy)
GitHub / Azure DevOps Integration
- GitHub repos ↔ Azure Boards (cross-link)
- GitHub Advanced Security (code scanning, secret scanning)
- Branch protection rules: require PR, status checks
Terraform IaC for GCP
Core Workflow
terraform init(provider, backend, modules)terraform plan(preview, output to file)terraform apply(execute plan)terraform destroy
State Management
- Backend: GCS bucket (terraform state)
- Locking: GCS supports object lock
terraform state list,terraform state mv- Remote state data source:
terraform_remote_state
GCP Provider
terraform {
backend "gcs" {
bucket = "my-tfstate-bucket"
prefix = "prod/network"
}
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
provider "google" {
project = var.project_id
region = var.region
}
Best Practices
- Modularize: modules/ directory
- Env segregation:
envs/dev/,envs/prod/ - Use
.tfvarsfiles per env - Pin provider version
- terraform plan in CI PR checks
Import & Move
terraform import google_container_cluster.my_cluster PROJECT/location/nameterraform state rmto remove from stateterraform state mvto restructure state
Ansible Configuration Management
Core Concepts
- Playbook: YAML with plays (hosts + tasks)
- Inventory: hosts file or dynamic (gcp_compute)
- Roles: reusable structured playbooks
- Modules: idempotent units (e.g.
copy,template,package)
GCP Dynamic Inventory
# requirements.yml
collections:
- google.cloud
# inventory.gcp.yml
plugin: gcp_compute
projects:
- my-project
filters:
- "labels.env=prod"
Common Pattern
---
- name: Configure web servers
hosts: webservers
become: yes
vars:
app_version: "{{ lookup('env', 'APP_VERSION') }}"
tasks:
- name: Install nginx
apt:
name: nginx
state: present
- name: Deploy app config
template:
src: app.conf.j2
dest: /etc/nginx/sites-enabled/app
SonarQube & Checkov Code Quality & IaC Security
SonarQube
- Static code analysis (SAST)
- Quality Gates: new code coverage, bugs, vulnerabilities
- Measures: reliability, security, maintainability, coverage, duplication
- CI integration:
sonar-scanner, GitHub Action:SonarSource/sonarcloud-github-action sonar-project.propertiesin repo root
Checkov (IaC)
- Scans Terraform, CloudFormation, K8s, Helm, Dockerfile
- Checks misconfigurations (public buckets, open ports, etc.)
- CI:
checkov --directory . --framework terraform .checkov.ymlfor skip/severity config- Bridgecrew cloud platform for GUI/policies
CI Integration Example
# GitHub Actions step
- name: IaC Security Scan
run: |
checkov --directory terraform/ \
--framework terraform \
--soft-fail \
--output cli --output junitxml > checkov-report.xml
- name: SonarCloud Scan
uses: SonarSource/sonarcloud-github-action@master
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
Vertex AI & MLOps
Vertex AI Components
- Vertex AI Pipelines: Kubeflow-based DAGs
- Model Registry: versioning, staging/production
- Feature Store: managed feature serving
- Endpoint: model serving with autoscaling
- Model Evaluation: classification, regression metrics
MLOps Pipeline
- Data validation (TFX) → Feature engineering → Train → Evaluate
- If metrics pass → Deploy to staging → Shadow traffic → Prod
- Monitoring: prediction drift, feature drift
- Retrain trigger: scheduled or drift threshold
MLOps Lifecycle
Raw Data → Data Validation → Feature Store
↓
Train (Vertex AI Training) → AutoML / Custom Container
↓
Evaluate (thresholds: accuracy, precision, recall)
↓
Model Registry (versioned)
↓
Deploy to Endpoint (canary → 100%)
↓
Monitor (Vertex AI Model Monitoring)
Key Concepts
- AutoML: train without code (image, tabular, text)
- Custom Training: bring your own container
- Hyperparameter Tuning: Vizier-based
- Explainable AI: feature attributions
CI/CD for ML
- Pipeline = code (Python SDK)
- Compile → Upload → Run (trigger on data change)
- Model validation in CI gate
- Deploy via CI:
gcloud ai endpoints deploy-model
Gemini & Google AI Studio
Gemini Models
- Gemini 1.5 Flash: fast, cost-effective
- Gemini 1.5 Pro: complex reasoning, long context (1M tokens)
- Gemini 2.0 Flash: latest (multimodal, tool use)
- Available via: AI Studio (dev) / Vertex AI (prod)
Google AI Studio
- Free tier, rapid prototyping
- System instructions, safety settings
- Export to Vertex AI for production
- API key based (no IAM)
Vertex AI Gemini API
- IAM-based auth (Service Account)
- Supports grounding (search, data store)
- Model Garden: curated + open models
- Safety filters, content moderation
Agents & Extensions
- Vertex AI Agent Builder: no-code/nocode agents
- Grounding in Google Search or enterprise data
- Function calling / tool use
- Conversational agents (Dialogflow integration)
API Comparison
# AI Studio (dev / prototyping)
curl -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=API_KEY"
# Vertex AI (production)
curl -X POST "https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT/locations/us-central1/publishers/google/models/gemini-2.0-flash:generateContent" \
-H "Authorization: Bearer $(gcloud auth print-access-token)"
Grafana Observability & Dashboards
Data Sources
- Prometheus (metrics from K8s/GKE)
- Cloud Monitoring (GCP metrics)
- Loki (logs)
- Tempo (traces)
- Cloud Logging (via Grafana Cloud or self-managed)
Key Features
- Dashboard as Code (JSON / Terraform)
- Alerting: rules, notification channels (PagerDuty, Slack)
- Annotations: mark deployments on graphs
- Explore: ad-hoc PromQL, LogQL queries
Common Dashboard Panels
- K8s Cluster: CPU/Memory, pod status, node health
- GKE Workload: request rate, latency (p50/p95/p99), error rate
- Terraform: number of resources, last plan time
- CI/CD: pipeline success rate, duration trend
- Vertex AI: prediction latency, error budget, model drift
PromQL Quick Reference
rate(http_requests_total[5m])— request ratehistogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))— p95 latencysum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)— CPU by namespaceavg(go_memstats_alloc_bytes) by (job)— memory by job
API Gateway GCP API Management
GCP Options
- Cloud Endpoints: Extensible Service Proxy (ESP), OpenAPI/gRPC
- Apigee: full lifecycle API management, monetization, analytics
- Cloud Load Balancer: L7 routing, URL maps, SSL
- Kong / Ambassador: open-source on K8s
When to Use What
- Simple REST/gRPC: Cloud Endpoints + Cloud LB
- Enterprise API management: Apigee
- K8s-native: Kong Ingress or Ambassador
- Serverless: API Gateway (Cloud Functions, Cloud Run)
Cloud Endpoints Example
# openapi.yaml
swagger: "2.0"
info:
title: "My API"
version: "1.0.0"
x-google-endpoints:
- name: "my-api.endpoints.PROJECT.cloud.goog"
x-google-backend:
address: "https://us-central1-PROJECT.cloudfunctions.net/my-function"
# Deploy:
gcloud endpoints services deploy openapi.yaml
Linux Essentials Troubleshooting & Admin
Process Management
ps aux | grep process— find processtop/htop— live resource monitorsystemctl status/start/stop/enablejournalctl -u service-name -f— logs
Disk & Memory
df -h— disk usagedu -sh *— directory sizesfree -h— memorylsblk— block devices
Network Troubleshooting
ping/traceroutess -tulpn— listening portscurl -v http://...— HTTP debugnslookup/dig— DNStcpdump -i eth0 port 80— packet capture
File Operations
grep -r "pattern" .— recursive searchfind / -name "file" 2>/dev/nullrsync -avz source/ dest/chmod/chown— permissionstar -czf archive.tar.gz dir
Diagnostic Sequence
# When something breaks:
1. Check logs: journalctl / tail -f /var/log/syslog
2. Check resources: top, df -h, free -h
3. Check network: ss -tulpn, curl, ping
4. Check config: diff expected vs actual, syntax check
5. Restart service: systemctl restart & systemctl status