VC4 — The Mesa 3D Graphics Library latest documentation

文章推薦指數: 80 %
投票人數:10人

Mesa's vc4 graphics driver supports multiple implementations of Broadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0 through Raspberry Pi 3 ... TheMesa3DGraphicsLibrary Documentation Introduction ProjectHistory Developers PlatformsandDrivers LicenseandCopyright FrequentlyAskedQuestions ReleaseNotes Acknowledgements DownloadandInstall DownloadingandUnpacking CompilingandInstalling PrecompiledLibraries Needhelp? MailingLists ReportaBug UserTopics ShadingLanguage EGL OpenGLES EnvironmentVariables Off-screenRendering DebuggingTips PerformanceTips PerfettoTracing MesaExtensions ApplicationIssues ViewperfIssues XlibSoftwareDriver Drivers ANV D3D12 Freedreno Lima LLVMpipe Panfrost RADV VMwareSVGA3D V3D VC4 GLES2support OpenGLsupport BugReporting TiledRendering PerformanceTricks Performancedebugging shader-db HardwareDocumentation Virtio-GPUVenus VirGL Zink DeveloperTopics SourceCodeRepository SourceCodeTree DevelopmentUtilities HelpWanted DevelopmentNotes CodingStyle SubmittingPatches ReleasingProcess ReleaseCalendar GLDispatch Gallium VulkanRuntime NIRIntermediateRepresentation(NIR) IntelSurfaceLayout(ISL) Android NotesformacOS LinuxKernelDrivers Testing ConformanceTesting ContinuousIntegration Links OpenGLWebsite DRIWebsite DeveloperBlogs Hostedby: freedesktop.org TheMesa3DGraphicsLibrary » VC4 EditonGitLab VC4¶ Mesa’svc4graphicsdriversupportsmultipleimplementationsof Broadcom’sVideoCoreIVGPU.ItisnotablyusedintheRaspberryPi0 throughRaspberryPi3hardware,andthedriverisincludedasan optionasofthe2016-02-09Rasbpianreleaseusingraspi-config. OnmostotherdistributionssuchasDebianorFedora,youneedno configurationtoenablethedriver. ThisMesadrivertalksdirectlytothevc4kernelDRM driverforschedulinggraphicscommands,andthatmodulealsoprovides KMSdisplaysupport.ThedrivermakesnouseoftheclosedsourceVPU firmwareontheVideoCoreIVblock,insteadtalkingdirectlytothe GPUblockfromLinux. GLES2support¶ Thevc4driverisanearlyconformantGLES2driver,andthehardware hasachievedGLES2conformancewithotherdriverstacks. OpenGLsupport¶ AlongwithGLES2.0,theMesadriveralsoexposesOpenGL2.1,whichis mostlycorrectbutwithafewcaveats. 4-byteindexbuffers. GLES2.0,andvc4,don’thaveGL_UNSIGNED_INTindexbuffers.Tosupport theminvc4,wecreateashadowcopyofyourindexbufferwiththe indicestruncatedto2bytes.Thisisincorrect(andwillassertion failindebugbuildsofMesa)ifanyoftheindiceswere>65535.To fixthat,wewouldneedtodetectthiscaseandrewritetheindex bufferandvertexbufferstodoaseriesofdrawseachwithsmall indicesandnewvertexattribbindings. Toavoidthisproblem,ensurethatallindexbuffersarewrittenusing GL_UNSIGNED_SHORT,evenatthecostofdoingmultipledrawcalls withupdatedvertexattribbindings. Occlusionqueries TheVC4hardwarehasnosupportforocclusionqueries.GL2.0 requiresthatyousupporttheocclusionqueriesextension,butyoucan report0fromglGetQueryiv(GL_SAMPLES_PASSED, GL_QUERY_COUNTER_BITS).Thisisabsurd,butit’showOpenGLhandles “wewantthefunctionstobepresenteverywhere,butwewantittobe optionalforhardwaretosupportit.Sadly,galliumdoesn’tyetallow thedrivertoreport0querybits. Primitivemode VC4doesn’tsupportreducingtriangles/quads/polygonstolinesand pointslikedesktopGL.Iffront/backmodematched,wecouldrewrite theindexbuffertothenewprimitivetype,butwedon’t.If front/backmodedon’tmatch,wewouldneedtorunthevertexshaderin software,classifytheprims,writenewindexbuffers,andemit (possiblymany)newdrawcallstorasterizethenewprimsinthesame order. BugReporting¶ VC4renderingbugsshouldgotoMesa’sgitlabissuespage. Byfartheeasiestwaytocommunicatebugreportsforrendering problemsistotakeanapitrace.Thispassesexactlythedrawingyou sawtothedeveloper,withoutthedeveloperneedingtodownloadand buildtheapplicationandreplicatewhateverstepsyoutooktoproduce theproblem.Tracesattachedtobugreportsshouldideallybesmall. ForGPUhangs,ifyoucangetashortapitracethatproducesthe problem,that’sstillthebest.Iftheproblemtakesalongtimeto reproduceoryoucan’tcaptureitinatrace,describinghowto reproduceandincludingagpuhangdumpwouldbethemost useful.Installvc4-gpu-tools anduse vc4_dump_hang_statemy-app.hang.Sometimesthehangfilewill provideusefulinformation. TiledRendering¶ VC4isatiledrenderer,choppingthescreeninto64x64(non-MSAA)or 32x32(MSAA)tilesandrenderingthescenepertile.Rasterization lookslike: (CPU)Allocatespacetostorealistofdrawcommandspertile (CPU)Setupacommandlistpertilethatdoes: Eitherloadthecurrenttile'scolorbufferfrommemory,orclearit. Eitherloadthecurrenttile'sdepthbufferfrommemory,orclearit. Branchintothedrawlistforthetile Storethedepthbufferifanybodymightreadit. Storethecolorbufferifanybodymightreadit. (GPU)Initializetheper-tiledrawcallliststoempty. (GPU)Runalldrawcallscollectingvertexdata (GPU)Foreachtilecoveredbyadrawcall'sprimitive. Emitstatepacketstothelisttoupdateittothecurrentdrawcall'sstate. Emitaprimitivedescriptionintothetile'sdrawcalllist. Tiledrenderingavoidstheneedforlargerendertargetcaches,atthe expenseofincreasingthecostofvertexprocessing.Unlikesometiled renderers,VC4hasnonon-tiledrenderingmode. PerformanceTricks¶ Reducingmemorybandwidthbyclearing. Evenifyourdrawingisgoingtocovertheentirerendertarget,it’s moreefficientforVC4ifyouemitaglClear()ofthecolorand depthbuffers.Thismeanswecanskiptheloadofthepreviousstate frommemory,infavorofacheapGPU-sidememset()ofthetile bufferbeforewestartrunningthedrawcalls. Reducingmemorybandwidthwithscissoring. IfalldrawcallsfortheframearewithaglScissor()toonly partofthescreen,thenwecanskipsettingupthetilesforthat area,whichmeansalittlelessmemoryusedsettinguptheemptybins, andalotlessmemoryusedloading/storingtheunchangedtiles. ReducingmemorybandwidthwithglInvalidateFramebuffer(). Ifwedon’tknowwhomightusethecontentsoftheframebuffer’sdepth orcolorinthefuture,thenwehavetostoreitforlater.Ifyouuse glInvalidateFramebuffer()beforeaccessingtheresultsofyour rendering,thenwecanskipthestoreofthedepthorcolor buffer.Notethatthisisunimplemented. Avoidnon-constantGLSLarrayindexing InVC4theonlynon-constant-indexarrayaccesssupportedinhardware isuniforms.Foreverythingelse(inputs,outputs,temporaries),we havetolowerthemtoanIFladderlike: if(index==0) returnarray[0] elseif(index==1) returnarray[1] ... Thisisveryexpensiveasweprobablyhavetoexecuteeverybranchof everyIFstatementduetoitbeingaSIMDmachine.So,itis recommended(ifyoucan)toavoidnon-uniformnon-constantarray indexing. NotethatifyoudovariableindexingwithinaboundedloopthatMesa canunroll,thatcanactuallycountasconstantindexing. IncreasingGPUmemoryIncreaseCMApoolsize ThememoryfortheVC4driverisallocatedfromthestandardLinuxcma pool.Thesizeofthispooldefaultsto64MB.Toincreasethis,pass anadditionalparameteronthekernelcommandline.Edittheboot partition’scmdline.txttoadd: cma=256M@256M cmdline.txtisasinglelinewithwhitespaceseparatedparameters. Thefirstvalueisthesizeofthepoolandthesecondparameteris thestartaddressofthepool.Thepoolsizecanbeincreasedfurther, butitmustfitintothememory,sosize+startaddressmustbebelow 1024M(Pi2,3,3+)or512M(PiB,B+,Zero,ZeroW).Alsothis reducesthememoryavailabletoLinux. Decreasefirmwarememory Thefirmwareallocatesafixedchunkofmemorybeforebooting Linux.Iffirmwarefunctionsarenotrequired,thisamountcanbe reduced. Inconfig.txteditgpu_memto16,ifyoudonotneedvideodecoding, editgpu_memto64ifyouneedvideodecoding. Performancedebugging¶ Step1:Knownissues Thefirsttooltolookatisrunningyourapplicationwiththe environmentvariableVC4_DEBUG=perfset.Thiswillreportdebug informationformanyknowncausesofperformanceproblemsonthe console.Notallofthemwillcausevisibleperformanceimprovements whenfixed,butit’sagoodfirststeptoseewhatmightgoingwrong. Step2:CPUvsGPU TheprimaryquestionisfiguringoutwhethertheCPUisbusyinyour application,theCPUisbusyintheGLdriver,theGPUiswaitingfor theCPU,ortheCPUiswaitingfortheGPU.Ideally,yougettothe pointwheretheCPUiswaitingfortheGPUinfrequentlybutfora significantamountoftime(howeverlongittakestheGPUtodrawa frame). Startwithtopwhileyourapplicationisrunning.IstheCPUusage around90%+?Ifso,thenourperformanceanalysiswillbewith sysprof.Ifit’snotveryhigh,istheGPUstayingbusy?Wedon’thave acleantoolforthisyet,butcat/debug/dri/0/v3d_regscouldbe useful.IfCT0CA!=CT0EAorCT1CA!=CT1EA,that meansthattheGPUiscurrentlybusyprocessingsomerenderingjob. sysprofforCPUusage IftheCPUistotallybusyandtheGPUisn’tterriblybusy,thereis anexcellenttoolfordebugging:sysprof.Install,runasroot(soyou cangetsystem-wideprofiling),hitplayandlaterstop.Thetop-left areashowstheflatprofilesortedbytotaltimeofthatsymbolplus itsdescendants.Thetopfewaregenerallyuninteresting(main()and itsdescendantsconsumingalot),buteventuallyyoucangetdownto somethinginteresting.Clickit,andtotherightyougetthe callchainstodescendants–whereallthattimeactuallywent.Onthe otherhand,thelowerleftshowscallers–double-clickingthose selectsthatasthesymboltoview,instead. Notethatyouneeddebugsymbolsforthecallgraphsinsysprofto work,whichiswheremostofitsvalueis.Mostdistributionsoffer debugsymbolpackagesfromtheirbuildswhichcanbeinstalled separately,andsysprofwillfindthem.I’vefoundthatonarm,the debugpackagesarenotenough,andifsomeonecoulddeterminewhatis necessaryforcallgraphsindebugging,thatwouldbereallyhelpful. perfforCPUwaitsonGPU IftheCPUisnotverybusyandtheGPUisnotverybusy,thenwe’re probablyping-pongingbetweenthetwo.Mostcasesofthiswouldbe noticedbyVC4_DEBUG=perf,butnotall.Toseeallcaseswhere thishappens,usetheperftoolfromtheLinuxkernel(note:unrelated toVC4_DEBUG=perf): sudoperfrecord-f-g-evc4:vc4_wait_for_seqno_begin-c1openarena Ifyouwanttoseethewholesystem’sstallsforaperiodoftime (veryuseful!),usethe-aflaginsteadofaparticularcommand name.Just^Cwhenyou’redonecapturingdata. Atexit,you’llhaveperf.datainthecurrentdirectory.Youcanprint outtheresultswith: perfreport|less DebuggingforGPUfullybusy AsofLinuxkernel4.17andMesa18.1,wenowexposethehardware’s performancecountersinOpenGL.Installapitrace,andtraceyour applicationwith: apitracetrace#forGLXapplications apitracetrace-aegl#forEGLapplications Onceyou’vecapturedatrace,youcanseewhatcountersareavailable andreplayitwhilelookingwhilelookingatsomeofthosecounters: apitracereplay.trace--list-metrics apitracereplay.trace--pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading Multiplecounterscanbecapturedatoncewithcommasseparatingthem. Onceyou’vefoundwhatdrawcallsaresurprisinglyexpensiveinoneof thecounters,youcanworkoutwhichonestheywereattheGLlevelby openingthetraceupinqapitraceandusing^-Gtojumptothatcall numberand^-LtolookuptheGLstateatthatcall. shader-db¶ shader-dbisoftenusedasaproxyforreal-worldappperformancewhen workingonthecompilerinMesa.Onvc4,thereisalotof state-dependentcodeintheshaders(likeblendingorvertexattribute formathandling),sothetypicalshader-dbwillmissimportant areasforoptimization.Instead,anholtwroteanewonebasedon apitraces.Onceyouhaveacollectionoftraces,startingfrom traces-db, youcantestacompilerchangeinthisshader-dbwith: ./run.py>before (cd../mesa&&makeinstall) ./run.py>after ./report.pybeforeafter HardwareDocumentation¶ Fordriverdevelopers,BroadcompubliclyreleasedaspecificationPDFforthe21553,which iscloselyrelatedtothevc4GPUpresentintheRaspberryPi.They alsoreleasedasnapshot ofacorrespondingAndroidgraphicsdriver.Thatgraphicsdriverwas portedtoRaspbianforademo,butwasnotexpectedtohaveongoing development. DeveloperswithNDAaccesswithBroadcomorRaspberryPican potentiallygetaccessto“simpenrose”,theCsoftwaresimulatorof theGPU.TheMesadriverincludesabackend(vc4_simulator.c)to usesimpenrosefromanx86systemwiththei915graphicsdriverwith allofthevc4renderingcommandsemulatedonsimpenroseandmemcpyed totherealGPU.



請為這篇文章評分?