Compute In Memory Based Heterogeneous AI Silicon

Latency
Optimized Silicon

A 3D-hybrid-cube processing Unit(HCU) featuring 64 TOPS Compute In Memory with Planar Node Process for tensors and Custom RISC-V for vectors. Sub-millisecond reasoning powered by 6GB-48GB 3D DRAM.

Custom CiM Tensor Core
Custom 3D Memory Controller
Custom ISA-Extended RISC-V
Custom All-Reduce Enhanced NoC
Cimicro AI Silicon

Compute In Memory
Custom Tensor Core

Our Compute In Memory core is built using hand-crafted Custom Circuits. By bypassing standard cell libraries, our MAC units utilize an optimized Addertree structure to achieve extreme efficiency for FP4 and FP8 tensor operations.

The Power of Custom Circuit
Area Footprint -65% vs Standard Cell
Energy Efficiency 3.5x Better
Proprietary Addertree Circuitry
Proprietary Addertree Circuitry Architecture
FP8 / FP4
Native Tensor Support
64 TOPS
Compute Performance
DRAM Layer 4
DRAM Layer 3
DRAM Layer 2
DRAM Layer 1
Logic Die (No Buffer Die)

Vertical
3D Integration

By stacking 6GB of DRAM directly atop our hybrid logic die and eliminating the Buffer Die, we unlock 2.0 TB/s bandwidth with zero vertical delay.

Custom Memory Controller
Low Latency
Direct Path
Adaptive
Charge/Discharge
Multi-Master
Optimization
High Bandwidth up to 8.0TB/s
Total Capacity up to 48 GB
BF16
Precision Acceleration
FP32
Non-linear Operations
Custom Vector ISA
vsetvli t0, a0, e32, m8, ta, ma
vle32.v v8, (a1)
vfadd.vv v8, v8, v16
vse32.v v8, (a2)

Custom RISC-V
Vector Engine

Complementing our Compute In Memory engine, the specialized RISC-V Vector Engine utilizes custom instruction set extensions to handle non-linear layers and complex vector arithmetic in BF16 and FP32 formats.

  • Extended SIMD Instructions
  • Full BF16 / FP32 Pipeline

Scalable NoC Fabric

Hardware-native Broadcast and All-Reduce support. Our self-developed NoC bus enables multi-cluster synchronization between Compute In Memory nodes at near-physical limits.

Scalable NoC Fabric Topology

Hardware All-Reduce

Dedicated hardware logic for prefix-sum and collective operations, slashing inter-cluster communication latency by 80%.

One-Shot Broadcast

Proprietary NoC fabric supports single-cycle operand broadcasting to all Compute In Memory clusters simultaneously.

N1
SRAM
N2
SRAM
N3
Data Movement
Node-SRAM Pipeline

Data-Flow
Driven Execution

Computation triggers automatically upon operand arrival. Whether it is a Compute In Memory Tensor op or a RISC-V Vector op, our architecture eliminates 40% energy waste by removing instruction-fetch wait states.

DATA_ARRIVED -> EXECUTE
Careers

JOIN THE CORE

Join us to craft the next generation of custom circuit AI silicon.

01
Hardware Engineering

Chip Design Engineer

CiM & Custom

Responsibilities

  • Design RTL for high-throughput Compute In Memory clusters.
  • Optimize micro-architecture for FP4/FP8 dataflows.
  • Manage timing closure on advanced logic nodes.

Requirements

  • Expertise in Verilog/SystemVerilog and ASIC design flow.
  • Experience with high-speed digital logic and custom circuits.
  • Strong background in computer architecture.
02
Hardware Engineering

Chip Verification Engineer

UVM / NoC

Responsibilities

  • Develop UVM testbenches for complex Compute In Memory nodes.
  • Ensure 100% functional and code coverage for silicon tape-out.
  • Execute gate-level simulations and debug complex hardware bugs.

Requirements

  • Proficiency in SystemVerilog, UVM, and scripting (Python).
  • Experience in coverage-driven verification for high-performance SoCs.
  • Solid understanding of AI hardware dataflows and memory.
03
Hardware Engineering

Circuit Design Engineer

Custom / SRAM

Responsibilities

  • Design custom high-speed SRAM and register files for NPU cores.
  • Optimize No Buffer Die signal path integrity and power delivery.
  • Collaborate on advanced node custom circuit layout and analysis.

Requirements

  • PhD or 5+ years in custom digital circuit design.
  • Deep expertise in SPICE simulation and mixed-signal flows.
  • Expertise in advanced process nodes.
04
Software & AI

Quantization Algorithm Engineer

FP4 / FP8

Responsibilities

  • Research FP4/FP8 quantization for multi-billion parameter LLMs.
  • Develop QAT (Quantization Aware Training) pipelines for custom silicon backends.
  • Minimize precision loss in hardware-native Compute In Memory inference.

Requirements

  • Mastery of PyTorch and low-precision numerical optimization.
  • Solid background in Transformer and LLM architectures.
  • PhD in AI research or 3+ years in deep learning optimization.
05
Software & AI

Inference Framework Engineer

MLIR / LLVM

Responsibilities

  • Optimize MLIR/LLVM-based backends for cimicro Compute In Memory engine.
  • Implement ultra-fast SIMD/Vector kernels for custom RISC-V cores.
  • Lead software-hardware co-design for high-efficiency memory management.

Requirements

  • Expertise in C++ and compiler infrastructure (MLIR, LLVM).
  • Knowledge of distributed computing and collective operations (All-Reduce).
  • Prior experience with GPU/NPU programming models.

Send your resume to:

hr@cimicro.ai