audio question answering

notes on my plans and progress for an audio question answering pipeline for huggingface

initial inspiration came from looking for an issue to work on and coming across the DocumentQuestionAnswering pipeline PR

huggingface/transformers DocumentQuestionAnswering PR

as the idea is very similar to DQA/VQA just with a new modality, i plan to reference DQA for a lot of this new PR

------------------

TODO

[x] add bare minimum to get the new "audio-question-answering" pipeline to pass with no actual processing applied. looks like just add audio_question_answering.py and update __init__.py

[x] bare minimum 1 model pass

[x] bare minimum 2 model pass

[x] load audio and apply ASR and pass

[x] above plus apply QA and pass

[] write helpful util stuff

[] make it good good

STATUS

to my surprise, it works already. loading both models, applying ASR and then QA

spending more thought now on how to make it a lot better. 1 subpar model is whatever, but subpar model + subpar model has compounding effect resulting in shit

think im going just clean it up and make it ready for PR, then ask HF people what they think limits should be for model sizes for pipelines

like it would be really good if we could assume people couple load both whisper v3 large turbo + llama 3.2 1b, but for the name of accessability this probably isnt the right way to do defaults

Notes/Updates

------------------update

going to go through the DQA code and come up with a plan for how to implement AQA. it should be very similar with the difference being applying STT instead of OCR

------------------update

got most of audio_question_answering.py done but came across a problem while updating other stuff related. so currently all SUPPORTED_TASKS outlined in __init__ of pipelines

only define 1 pt and/or 1 tf default model, since most pipelines only use a single model. while aqa relies on 2 models, one for the initial asr and then one for qa.

ended up realizing that DQA while a 2 step process, only has a single default model for qa since tesseract is used for ocr instead of a vision model

going to see what, if anything, needs to be updated so that SUPPORTED_TASKS can support defining 2 default models.

------------------

okay made some progress but just want to add a note to list all the potential things to update

class.transformers.pipelines.QuestionAnsweringArguementHandler "QuestionAnsweringPipeline requires the user to provide multiple arguments to be mapped to internal SquadExample"

and "QuestionAnsweringArguementHandler manages all the possible to create a SquadExample from the command-line supplied arguements"

reading the handler it talks about needing to provide a dictionary with keys {question:..., context:...} this also applies to AQA so should consider a handler or something that works with AQA. the current handler only really applies to DQA

"Can't instantiate abstract class AudioQuestionAnsweringPipeline withou an implementation for abstract methods _forward, _sanitize_parameters, postprocess, preprocess"

------------------