notes on my plans and progress for an audio question answering pipeline for huggingface
my initial AQA issueinitial inspiration came from looking for an issue to work on and coming across the DocumentQuestionAnswering pipeline PR
huggingface/transformers DocumentQuestionAnswering PRas the idea is very similar to DQA/VQA just with a new modality, i plan to reference DQA for a lot of this new PR
------------------
[x] add bare minimum to get the new "audio-question-answering" pipeline to pass with no actual processing applied. looks like just add audio_question_answering.py and update __init__.py
[x] bare minimum 1 model pass
[x] bare minimum 2 model pass
[x] load audio and apply ASR and pass
[x] above plus apply QA and pass
[] write helpful util stuff
[] make it good good
to my surprise, it works already. loading both models, applying ASR and then QA
spending more thought now on how to make it a lot better. 1 subpar model is whatever, but subpar model + subpar model has compounding effect resulting in shit
think im going just clean it up and make it ready for PR, then ask HF people what they think limits should be for model sizes for pipelines
like it would be really good if we could assume people couple load both whisper v3 large turbo + llama 3.2 1b, but for the name of accessability this probably isnt the right way to do defaults
------------------update
going to go through the DQA code and come up with a plan for how to implement AQA. it should be very similar with the difference being applying STT instead of OCR
------------------update
got most of audio_question_answering.py done but came across a problem while updating other stuff related. so currently all SUPPORTED_TASKS outlined in __init__ of pipelines
only define 1 pt and/or 1 tf default model, since most pipelines only use a single model. while aqa relies on 2 models, one for the initial asr and then one for qa.
ended up realizing that DQA while a 2 step process, only has a single default model for qa since tesseract is used for ocr instead of a vision model
going to see what, if anything, needs to be updated so that SUPPORTED_TASKS can support defining 2 default models.
------------------
okay made some progress but just want to add a note to list all the potential things to update
class.transformers.pipelines.QuestionAnsweringArguementHandler "QuestionAnsweringPipeline requires the user to provide multiple arguments to be mapped to internal SquadExample"
and "QuestionAnsweringArguementHandler manages all the possible to create a SquadExample from the command-line supplied arguements"
reading the handler it talks about needing to provide a dictionary with keys {question:..., context:...} this also applies to AQA so should consider a handler or something that works with AQA. the current handler only really applies to DQA
"Can't instantiate abstract class AudioQuestionAnsweringPipeline withou an implementation for abstract methods _forward, _sanitize_parameters, postprocess, preprocess"
------------------
------------------
directory