flink-user mailing list archives

Hi Guys,
Can someone please help me in understanding this ?
Regards,
Vinay Patil
On Thu, Apr 27, 2017 at 12:36 PM, Vinay Patil <vinay18.patil@gmail.com>
wrote:
> Hi Guys,
>
> For historical reprocessing , I am reading the avro data from S3 and
> passing these records to the same pipeline for processing.
>
> I have the following queries:
>
> 1. I am running this pipeline as a stream application with checkpointing
> enabled, the records are successfully written to S3, however they remain in
> the pending state as checkpointing is not triggered when I doing
> re-processing. Why does this happen ? (kept the checkpointing interval to 1
> minute, pipeline ran for 10 minutes)
> this is the code I am using for reading avro data from S3
>
>
>
>
>
> *AvroInputFormat<SomeAvroClass> avroInputFormat = new AvroInputFormat<>(
> new org.apache.flink.core.fs.Path(s3Path),
> SomeAvroClass.class); sourceStream =
> env.createInput(avroInputFormat).map(...); *
>
> 2. For the source stream Flink sets the parallelism as 1 , and for the
> rest of the operators the user specified parallelism is set. How does Flink
> reads the data ? does it bring the entire file from S3 one at a time and
> then Split it according to parallelism ?
>
> 3. I am reading from two different S3 folders and treating them as
> separate sourceStreams, how does Flink reads data in this case ? does it
> pick one file from each S3 folder , split the data and pass it downstream ?
> Does Flink reads the data sequentially ? I am confused here as only one
> Task Manager is reading the data from S3 and then all TM's are getting the
> data.
>
> 4. Although I am running this as as stream application, the operators goes
> into FINISHED state after processing , is this because Flink treats the S3
> source as finite data ? What will happen if the data is continuously
> written to S3 from one pipeline and from the second pipeline I am doing
> historical re-processing ?
>
> Regards,
> Vinay Patil
>