Mesa's vc4 graphics driver supports multiple implementations of Broadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0 through Raspberry Pi 3 ...
TheMesa3DGraphicsLibrary
Documentation
Introduction
ProjectHistory
Developers
PlatformsandDrivers
LicenseandCopyright
FrequentlyAskedQuestions
ReleaseNotes
Acknowledgements
DownloadandInstall
DownloadingandUnpacking
CompilingandInstalling
PrecompiledLibraries
Needhelp?
MailingLists
ReportaBug
UserTopics
ShadingLanguage
EGL
OpenGLES
EnvironmentVariables
Off-screenRendering
DebuggingTips
PerformanceTips
PerfettoTracing
MesaExtensions
ApplicationIssues
ViewperfIssues
XlibSoftwareDriver
Drivers
ANV
D3D12
Freedreno
Lima
LLVMpipe
Panfrost
RADV
VMwareSVGA3D
V3D
VC4
GLES2support
OpenGLsupport
BugReporting
TiledRendering
PerformanceTricks
Performancedebugging
shader-db
HardwareDocumentation
Virtio-GPUVenus
VirGL
Zink
DeveloperTopics
SourceCodeRepository
SourceCodeTree
DevelopmentUtilities
HelpWanted
DevelopmentNotes
CodingStyle
SubmittingPatches
ReleasingProcess
ReleaseCalendar
GLDispatch
Gallium
VulkanRuntime
NIRIntermediateRepresentation(NIR)
IntelSurfaceLayout(ISL)
Android
NotesformacOS
LinuxKernelDrivers
Testing
ConformanceTesting
ContinuousIntegration
Links
OpenGLWebsite
DRIWebsite
DeveloperBlogs
Hostedby:
freedesktop.org
TheMesa3DGraphicsLibrary
»
VC4
EditonGitLab
VC4¶
Mesa’svc4graphicsdriversupportsmultipleimplementationsof
Broadcom’sVideoCoreIVGPU.ItisnotablyusedintheRaspberryPi0
throughRaspberryPi3hardware,andthedriverisincludedasan
optionasofthe2016-02-09Rasbpianreleaseusingraspi-config.
OnmostotherdistributionssuchasDebianorFedora,youneedno
configurationtoenablethedriver.
ThisMesadrivertalksdirectlytothevc4kernelDRM
driverforschedulinggraphicscommands,andthatmodulealsoprovides
KMSdisplaysupport.ThedrivermakesnouseoftheclosedsourceVPU
firmwareontheVideoCoreIVblock,insteadtalkingdirectlytothe
GPUblockfromLinux.
GLES2support¶
Thevc4driverisanearlyconformantGLES2driver,andthehardware
hasachievedGLES2conformancewithotherdriverstacks.
OpenGLsupport¶
AlongwithGLES2.0,theMesadriveralsoexposesOpenGL2.1,whichis
mostlycorrectbutwithafewcaveats.
4-byteindexbuffers.
GLES2.0,andvc4,don’thaveGL_UNSIGNED_INTindexbuffers.Tosupport
theminvc4,wecreateashadowcopyofyourindexbufferwiththe
indicestruncatedto2bytes.Thisisincorrect(andwillassertion
failindebugbuildsofMesa)ifanyoftheindiceswere>65535.To
fixthat,wewouldneedtodetectthiscaseandrewritetheindex
bufferandvertexbufferstodoaseriesofdrawseachwithsmall
indicesandnewvertexattribbindings.
Toavoidthisproblem,ensurethatallindexbuffersarewrittenusing
GL_UNSIGNED_SHORT,evenatthecostofdoingmultipledrawcalls
withupdatedvertexattribbindings.
Occlusionqueries
TheVC4hardwarehasnosupportforocclusionqueries.GL2.0
requiresthatyousupporttheocclusionqueriesextension,butyoucan
report0fromglGetQueryiv(GL_SAMPLES_PASSED,
GL_QUERY_COUNTER_BITS).Thisisabsurd,butit’showOpenGLhandles
“wewantthefunctionstobepresenteverywhere,butwewantittobe
optionalforhardwaretosupportit.Sadly,galliumdoesn’tyetallow
thedrivertoreport0querybits.
Primitivemode
VC4doesn’tsupportreducingtriangles/quads/polygonstolinesand
pointslikedesktopGL.Iffront/backmodematched,wecouldrewrite
theindexbuffertothenewprimitivetype,butwedon’t.If
front/backmodedon’tmatch,wewouldneedtorunthevertexshaderin
software,classifytheprims,writenewindexbuffers,andemit
(possiblymany)newdrawcallstorasterizethenewprimsinthesame
order.
BugReporting¶
VC4renderingbugsshouldgotoMesa’sgitlabissuespage.
Byfartheeasiestwaytocommunicatebugreportsforrendering
problemsistotakeanapitrace.Thispassesexactlythedrawingyou
sawtothedeveloper,withoutthedeveloperneedingtodownloadand
buildtheapplicationandreplicatewhateverstepsyoutooktoproduce
theproblem.Tracesattachedtobugreportsshouldideallybesmall.
ForGPUhangs,ifyoucangetashortapitracethatproducesthe
problem,that’sstillthebest.Iftheproblemtakesalongtimeto
reproduceoryoucan’tcaptureitinatrace,describinghowto
reproduceandincludingagpuhangdumpwouldbethemost
useful.Installvc4-gpu-tools
anduse
vc4_dump_hang_statemy-app.hang.Sometimesthehangfilewill
provideusefulinformation.
TiledRendering¶
VC4isatiledrenderer,choppingthescreeninto64x64(non-MSAA)or
32x32(MSAA)tilesandrenderingthescenepertile.Rasterization
lookslike:
(CPU)Allocatespacetostorealistofdrawcommandspertile
(CPU)Setupacommandlistpertilethatdoes:
Eitherloadthecurrenttile'scolorbufferfrommemory,orclearit.
Eitherloadthecurrenttile'sdepthbufferfrommemory,orclearit.
Branchintothedrawlistforthetile
Storethedepthbufferifanybodymightreadit.
Storethecolorbufferifanybodymightreadit.
(GPU)Initializetheper-tiledrawcallliststoempty.
(GPU)Runalldrawcallscollectingvertexdata
(GPU)Foreachtilecoveredbyadrawcall'sprimitive.
Emitstatepacketstothelisttoupdateittothecurrentdrawcall'sstate.
Emitaprimitivedescriptionintothetile'sdrawcalllist.
Tiledrenderingavoidstheneedforlargerendertargetcaches,atthe
expenseofincreasingthecostofvertexprocessing.Unlikesometiled
renderers,VC4hasnonon-tiledrenderingmode.
PerformanceTricks¶
Reducingmemorybandwidthbyclearing.
Evenifyourdrawingisgoingtocovertheentirerendertarget,it’s
moreefficientforVC4ifyouemitaglClear()ofthecolorand
depthbuffers.Thismeanswecanskiptheloadofthepreviousstate
frommemory,infavorofacheapGPU-sidememset()ofthetile
bufferbeforewestartrunningthedrawcalls.
Reducingmemorybandwidthwithscissoring.
IfalldrawcallsfortheframearewithaglScissor()toonly
partofthescreen,thenwecanskipsettingupthetilesforthat
area,whichmeansalittlelessmemoryusedsettinguptheemptybins,
andalotlessmemoryusedloading/storingtheunchangedtiles.
ReducingmemorybandwidthwithglInvalidateFramebuffer().
Ifwedon’tknowwhomightusethecontentsoftheframebuffer’sdepth
orcolorinthefuture,thenwehavetostoreitforlater.Ifyouuse
glInvalidateFramebuffer()beforeaccessingtheresultsofyour
rendering,thenwecanskipthestoreofthedepthorcolor
buffer.Notethatthisisunimplemented.
Avoidnon-constantGLSLarrayindexing
InVC4theonlynon-constant-indexarrayaccesssupportedinhardware
isuniforms.Foreverythingelse(inputs,outputs,temporaries),we
havetolowerthemtoanIFladderlike:
if(index==0)
returnarray[0]
elseif(index==1)
returnarray[1]
...
Thisisveryexpensiveasweprobablyhavetoexecuteeverybranchof
everyIFstatementduetoitbeingaSIMDmachine.So,itis
recommended(ifyoucan)toavoidnon-uniformnon-constantarray
indexing.
NotethatifyoudovariableindexingwithinaboundedloopthatMesa
canunroll,thatcanactuallycountasconstantindexing.
IncreasingGPUmemoryIncreaseCMApoolsize
ThememoryfortheVC4driverisallocatedfromthestandardLinuxcma
pool.Thesizeofthispooldefaultsto64MB.Toincreasethis,pass
anadditionalparameteronthekernelcommandline.Edittheboot
partition’scmdline.txttoadd:
cma=256M@256M
cmdline.txtisasinglelinewithwhitespaceseparatedparameters.
Thefirstvalueisthesizeofthepoolandthesecondparameteris
thestartaddressofthepool.Thepoolsizecanbeincreasedfurther,
butitmustfitintothememory,sosize+startaddressmustbebelow
1024M(Pi2,3,3+)or512M(PiB,B+,Zero,ZeroW).Alsothis
reducesthememoryavailabletoLinux.
Decreasefirmwarememory
Thefirmwareallocatesafixedchunkofmemorybeforebooting
Linux.Iffirmwarefunctionsarenotrequired,thisamountcanbe
reduced.
Inconfig.txteditgpu_memto16,ifyoudonotneedvideodecoding,
editgpu_memto64ifyouneedvideodecoding.
Performancedebugging¶
Step1:Knownissues
Thefirsttooltolookatisrunningyourapplicationwiththe
environmentvariableVC4_DEBUG=perfset.Thiswillreportdebug
informationformanyknowncausesofperformanceproblemsonthe
console.Notallofthemwillcausevisibleperformanceimprovements
whenfixed,butit’sagoodfirststeptoseewhatmightgoingwrong.
Step2:CPUvsGPU
TheprimaryquestionisfiguringoutwhethertheCPUisbusyinyour
application,theCPUisbusyintheGLdriver,theGPUiswaitingfor
theCPU,ortheCPUiswaitingfortheGPU.Ideally,yougettothe
pointwheretheCPUiswaitingfortheGPUinfrequentlybutfora
significantamountoftime(howeverlongittakestheGPUtodrawa
frame).
Startwithtopwhileyourapplicationisrunning.IstheCPUusage
around90%+?Ifso,thenourperformanceanalysiswillbewith
sysprof.Ifit’snotveryhigh,istheGPUstayingbusy?Wedon’thave
acleantoolforthisyet,butcat/debug/dri/0/v3d_regscouldbe
useful.IfCT0CA!=CT0EAorCT1CA!=CT1EA,that
meansthattheGPUiscurrentlybusyprocessingsomerenderingjob.
sysprofforCPUusage
IftheCPUistotallybusyandtheGPUisn’tterriblybusy,thereis
anexcellenttoolfordebugging:sysprof.Install,runasroot(soyou
cangetsystem-wideprofiling),hitplayandlaterstop.Thetop-left
areashowstheflatprofilesortedbytotaltimeofthatsymbolplus
itsdescendants.Thetopfewaregenerallyuninteresting(main()and
itsdescendantsconsumingalot),buteventuallyyoucangetdownto
somethinginteresting.Clickit,andtotherightyougetthe
callchainstodescendants–whereallthattimeactuallywent.Onthe
otherhand,thelowerleftshowscallers–double-clickingthose
selectsthatasthesymboltoview,instead.
Notethatyouneeddebugsymbolsforthecallgraphsinsysprofto
work,whichiswheremostofitsvalueis.Mostdistributionsoffer
debugsymbolpackagesfromtheirbuildswhichcanbeinstalled
separately,andsysprofwillfindthem.I’vefoundthatonarm,the
debugpackagesarenotenough,andifsomeonecoulddeterminewhatis
necessaryforcallgraphsindebugging,thatwouldbereallyhelpful.
perfforCPUwaitsonGPU
IftheCPUisnotverybusyandtheGPUisnotverybusy,thenwe’re
probablyping-pongingbetweenthetwo.Mostcasesofthiswouldbe
noticedbyVC4_DEBUG=perf,butnotall.Toseeallcaseswhere
thishappens,usetheperftoolfromtheLinuxkernel(note:unrelated
toVC4_DEBUG=perf):
sudoperfrecord-f-g-evc4:vc4_wait_for_seqno_begin-c1openarena
Ifyouwanttoseethewholesystem’sstallsforaperiodoftime
(veryuseful!),usethe-aflaginsteadofaparticularcommand
name.Just^Cwhenyou’redonecapturingdata.
Atexit,you’llhaveperf.datainthecurrentdirectory.Youcanprint
outtheresultswith:
perfreport|less
DebuggingforGPUfullybusy
AsofLinuxkernel4.17andMesa18.1,wenowexposethehardware’s
performancecountersinOpenGL.Installapitrace,andtraceyour
applicationwith:
apitracetrace#forGLXapplications
apitracetrace-aegl#forEGLapplications
Onceyou’vecapturedatrace,youcanseewhatcountersareavailable
andreplayitwhilelookingwhilelookingatsomeofthosecounters:
apitracereplay.trace--list-metrics
apitracereplay.trace--pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading
Multiplecounterscanbecapturedatoncewithcommasseparatingthem.
Onceyou’vefoundwhatdrawcallsaresurprisinglyexpensiveinoneof
thecounters,youcanworkoutwhichonestheywereattheGLlevelby
openingthetraceupinqapitraceandusing^-Gtojumptothatcall
numberand^-LtolookuptheGLstateatthatcall.
shader-db¶
shader-dbisoftenusedasaproxyforreal-worldappperformancewhen
workingonthecompilerinMesa.Onvc4,thereisalotof
state-dependentcodeintheshaders(likeblendingorvertexattribute
formathandling),sothetypicalshader-dbwillmissimportant
areasforoptimization.Instead,anholtwroteanewonebasedon
apitraces.Onceyouhaveacollectionoftraces,startingfrom
traces-db,
youcantestacompilerchangeinthisshader-dbwith:
./run.py>before
(cd../mesa&&makeinstall)
./run.py>after
./report.pybeforeafter
HardwareDocumentation¶
Fordriverdevelopers,BroadcompubliclyreleasedaspecificationPDFforthe21553,which
iscloselyrelatedtothevc4GPUpresentintheRaspberryPi.They
alsoreleasedasnapshot
ofacorrespondingAndroidgraphicsdriver.Thatgraphicsdriverwas
portedtoRaspbianforademo,butwasnotexpectedtohaveongoing
development.
DeveloperswithNDAaccesswithBroadcomorRaspberryPican
potentiallygetaccessto“simpenrose”,theCsoftwaresimulatorof
theGPU.TheMesadriverincludesabackend(vc4_simulator.c)to
usesimpenrosefromanx86systemwiththei915graphicsdriverwith
allofthevc4renderingcommandsemulatedonsimpenroseandmemcpyed
totherealGPU.